Lecture Notes in Computer Science
Edited by G. Goos, J. Hartmanis and J. van Leeuwen
2130
3
Berlin Heidelberg New York Barcelona Hong Kong London Milan Paris Singapore Tokyo
Georg Dorffner Horst Bischof Kurt Hornik (Eds.)
Artificial Neural Networks – ICANN 2001 International Conference Vienna, Austria, August 2125, 2001 Proceedings
13
Series Editors Gerhard Goos, Karlsruhe University, Germany Juris Hartmanis, Cornell University, NY, USA Jan van Leeuwen, Utrecht University, The Netherlands Volume Editors Georg Dorffner University of Vienna, Dept. of Mecidal Cybernetics and Artificial Intelligence Freyung 6/2, 1010 Vienna, Austria Email:
[email protected] Horst Bischof Technical University of Vienna, Institute for Computer Aided Automation Pattern Recognition and Image Processing Group Favoritenstr. 9/1832, 1040 Vienna, Austria Email:
[email protected] Kurt Hornik Wirtschaftsuniversit¨at Wien, Institut f¨ur Statistik Augasse 26, 1090 Wien, Austria Email:
[email protected] CataloginginPublication Data applied for Die Deutsche Bibliothek  CIPEinheitsaufnahme Artificial neural networks : international conference ; proceedings / ICANN 2001, Vienna, Austria, August 21  25, 2001. Georg Dorffner ... (ed.).  Berlin ; Heidelberg ; New York ; Barcelona ; Hong Kong ; London ; Milan ; Paris ; Singapore ; Tokyo : Springer, 2001 (Lecture notes in computer science ; Vol. 2130) ISBN 3540424865 CR Subject Classification (1998): F.1, I.2, I.5, I.4, G.3, J.3, C.2.1, C.1.3 ISSN 03029743 ISBN 3540424865 SpringerVerlag Berlin Heidelberg New York This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from SpringerVerlag. Violations are liable for prosecution under the German Copyright Law. SpringerVerlag Berlin Heidelberg New York a member of BertelsmannSpringer Science+Business Media GmbH http://www.springer.de © SpringerVerlag Berlin Heidelberg 2001 Printed in Germany Typesetting: Cameraready by author, data conversion by Olgun Computergrafik Printed on acidfree paper SPIN 10840062 06/3142 543210
Preface
This book is based on the papers presented at the International Conference on Artiﬁcial Neural Networks, ICANN 2001, from August 21–25, 2001 at the Vienna University of Technology, Austria. The conference is organized by the Austrian Research Institute for Artiﬁcal Intelligence in cooperation with the Pattern Recognition and Image Processing Group and the Center for Computational Intelligence at the Vienna University of Technology. The ICANN conferences were initiated in 1991 and have become the major European meeting in the ﬁeld of neural networks. From about 300 submitted papers, the program committee selected 171 for publication. Each paper has been reviewed by three program committee members/reviewers. We would like to thank all the members of the program committee and the reviewers for their great eﬀort in the reviewing process and helping us to set up a scientiﬁc program of high quality. In addition, we have invited eight speakers; three of their papers are also included in the proceedings. We would like to thank the European Neural Network Society (ENNS) for their support. We acknowledge the ﬁnancial support of Austrian Airlines, Austrian Science Foundation (FWF) under the contract SFB 010, Austrian Society ¨ for Artiﬁcial Intelligence (OGAI), Bank Austria, and the Vienna Convention Bureau. We would like to express our sincere thanks to A. Flexer, W. Horn, K. Hraby, F. Leisch, C. Schittenkopf, and A. Weingessel. The conference and the proceedings would not have been possible without their enormous contribution.
Vienna, June 2001
Georg Dorﬀner Horst Bischof Kurt Hornik Program CoChairs ICANN 2001
Organization
ICANN 2001 is organized by the Austrian Research Institute for Artiﬁcial Intelligence in cooperation with the Pattern Recognition and Image Processing Group and the Center for Computational Intelligence at the Vienna University of Technology.
Executive Committee General Chair: Program CoChairs: Organizing Committee:
Workshop Chair:
Georg Dorﬀner, Austria Horst Bischof, Austria Kurt Hornik, Austria Arthur Flexer, Austria Werner Horn, Austria Karin Hraby, Austria Friedrich Leisch, Austria Andreas Weingessel, Austria Christian Schittenkopf, Austria
Program Committee ShunIchi Amari, Japan Peter Auer, Austria Pierre Baldi, USA Peter Bartlett, Australia Shai Ben David, Israel Matthias Budil, Austria Joachim Buhmann, Germany Rodney Cotterill, Denmark Gary Cottrell, USA Kostas Diamantaras, Greece Wlodek Duch, Poland Peter Erdi, Hungary Patrick Gallinari, France Wulfram Gerstner, Switzerland Stan Gielen, The Netherlands Stefanos Kollias, Greece Vera Kurkova, Czech Republic Anders Lansner, Sweden Aleˇs Leonardis, Slovenia
HansPeter Mallot, Germany Christoph von der Malsburg, Germany KlausRobert M¨ uller, Germany Thomas Natschl¨ager, Austria Lars Niklasson, Sweden Stefano Nolﬁ, Italy Erkki Oja, Finland Gert Pfurtscheller, Austria Ulrich Ramacher, Germany Stephen Roberts, UK Mariagiovanna Sami, Italy J¨ urgen Schmidhuber, Switzerland Olli Simula, Finland Peter Sincak, Slovakia John Taylor, UK Carme Torras, Spain Volker Tresp, Germany Michel Verleysen, Belgium
Organization
VII
Additional Referees Panagiotis Adamidis Esa Alhoniemi Gabriela Andrejkova Matthias Bethge Leon Bobrowski Mikael Bod´en Sander Bohte Roman Borisyuk Ronald Bormann Leon Bottou Raﬀaele Calabretta Angelo Cangelosi Cristiano Cervellera Eleni Charou Rizwan Choudrey Paolo Coletta Marie Cottrell Aaron D’Souza Eric de Bodt Daniele Denaro Andrea Di Ferdinando Markus Diesmann Hubert Dinse Sara Dolnicar Jos´e Dorronsoro Douglas Eck Julian Eggert ¨ Orjan Ekeberg Udo Ernst Christian Eurich Richard Everson Sergio Exel Attila Fazekas Arthur Flexer Mikel L. Forcada Felipe M. G. Franca Paolo Frasconi Frederick Garcia Philippe Gaussier Apostolos Georgakis Felix Gers MarcOliver Gewaltig Zoubin Ghahramani R´emi Gilleron
Michele Giugliano Christian Goerick Mirta Gordon Bernhard Graimann Ying Guo John Hallam Inman Harvey Robert Haschke Rolf Henkel Tom Heskes Sepp Hochreiter Jaakko Hollmen Anders Holst Dirk Husmeier Marcus Hutter Ari H¨ am¨al¨ ainen Robert Jacobs Thomas M. Jorgensen Christian Jutten Paul Kainen Samuel Kaski Yakov Kazanovich Richard Kempter Werner M. Kistler Jens Kohlmorgen Joost N. Kok Jeanette H. Kotaleski Jerzy Korczak Constantine Kotropoulos Stanislav Kovacic Stefan Kozak Brigitte Krenn Malgorzata Kretowska Norbert Kr¨ uger Ben Kr¨ose Vlado Kvasnicka Ivo Kwee Peter K¨onig Jorma Laaksonen Krista Lagus Wee Sun Lee CharlesAlbert Lehalle Fritz Leisch Uros Lotric
Dominique Martinez Peter Meinicke Thomas Melzer Risto Miikkulainen Sebastian Mika Jos´e del R. Mill´an Igor Mokris Noboru Murata JeanPierre Nadal Hiroyuki Nakahara Ralph Neuneier Athanasios Nikolaidis Nikos Nikolaidis Klaus Obermayer Vladimir Olej Luigi Pagliarini Domenico Parisi Olivier Teytaud Helene PaugamMoisy Stavros Perantonis Conrad Perez Markus Peura Frederic Piat Fernando Pineda Gianluca Pollastri Gunnar R¨ atsch Kimmo Raivio Erhard Rank Carl Edward Rasmussen Iead Rezek Carlos Ribeiro Helge Ritter Tobias Rodemann Raul Rojas Agostinho Rosa Volker Roth Ulrich R¨ uckert Vicente Ruiz de Angulo Pal Rujan David Saad Mariagiovanna Sami Marcello Sanguineti Petr Savicky Christian Schittenkopf
VIII
Organization
Michael Schmitt Bernhard Sch¨ olkopf Nicol Schraudolph Ren´e Sch¨ uﬀny Walter Senn Terezie Sidlofova Jiˇri Sima Zenon Sosnowski Andreas Stafylopatis Arnost Stedry Branko Ster Volker Steuber Petr Stepan Piotr Suﬀczynski Johan Suykens
Peter Sykacek Anastasios Tefas Andreas Thiel Peter Tino Marc Tommasi Mira Trebar Edmondo Trentin Soﬁa Tsekeridou Koji Tsuda Panagiotis Tzionas Shiro Usui Nikolaos Vassilas Juha Vesanto Ricardo Vigario Sethu Vijayakumar
Alessandro Villa Nikos Vlassis Alpo V¨ arri Andreas Weingessel Heiko Wersing Nicholas Wickstr¨om Wim Wiegerinck Stefan Wilke Laurenz Wiskott Rolf P. W¨ urtz Tony Zador Tom Ziemke
Sponsoring Institutions – – – – – – –
Austrian Airlines Austrian Research Institute for Artiﬁcial Intelligence Austrian Science Foundation (FWF) under the contract SFB 010 ¨ Austrian Society for Artiﬁcial Intelligence (OGAI) Bank Austria Vienna Convention Bureau Vienna University of Technology
Table of Contents
Invited Papers The Complementary Brain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Stephen Grossberg
3
Neural Networks for Adaptive Processing of Structured Data . . . . . . . . . . . . Alessandro Sperduti
5
Bad Design and Good Performance: Strategies of the Visual System for Enhanced Scene Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Florentin W¨ org¨ otter
Data Analysis and Pattern Recognition Fast Curvature MatrixVector Products . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 Nicol N. Schraudolph Architecture Selection in NLDA Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 Jos´e R. Dorronsoro, Ana M. Gonz´ alez, and Carlos Santa Cruz Neural Learning Invariant to Network Size Changes . . . . . . . . . . . . . . . . . . . . 33 Vicente Ruiz de Angulo and Carme Torras Boosting Mixture Models for Semisupervised Learning . . . . . . . . . . . . . . . . . 41 Yves Grandvalet, Florence d’Alch´eBuc, and Christophe Ambroise Bagging Can Stabilize without Reducing Variance . . . . . . . . . . . . . . . . . . . . . . 49 Yves Grandvalet Symbolic Prosody Modeling by Causal Retrocausal NNs with Variable Context Length . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 Achim F. M¨ uller and Hans Georg Zimmermann Discriminative Dimensionality Reduction Based on Generalized LVQ . . . . . 65 Atsushi Sato A Computational Intelligence Approach to Optimization with Unknown Objective Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 Hirotaka Nakayama, Masao Arakawa, and Rie Sasaki Clustering Gene Expression Data by Mutual Information with Gene Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 Samuel Kaski, Janne Sinkkonen, and Janne Nikkil¨ a
X
Table of Contents
Learning to Learn Using Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 Sepp Hochreiter, A. Steven Younger, and Peter R. Conwell A Variational Approach to Robust Regression . . . . . . . . . . . . . . . . . . . . . . . . . 95 Anita C. Faul and Michael E. Tipping MinimumEntropy Data Clustering Using Reversible Jump Markov Chain Monte Carlo . . . . . . . . . . . . . . . . . . . . . 103 Stephen J. Roberts, Christopher Holmes, and David Denison Behavioral Market Segmentation of Binary Guest Survey Data with Bagged Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 Sara Dolniˇcar and Friedrich Leisch Direct Estimation of Polynomial Densities in Generalized RBF Networks Using Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 Evangelos Dermatas Generalisation Improvement of Radial Basis Function Networks Based on Qualitative Input Conditioning for Financial Credit Risk Prediction . . . 127 Xavier Parra, N´ uria Agell, and Xari Rovira Approximation of Bayesian Discriminant Function by Neural Networks in Terms of KullbackLeibler Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 Yoshifusa Ito and Cidambi Srinivasan The BiasVariance Dilemma of the Monte Carlo Method . . . . . . . . . . . . . . . . 141 Zlochin Mark and Yoram Baram A Markov Chain Monte Carlo Algorithm for the Quadratic Assignment Problem Based on Replicator Equations . . . . 148 Takehiro Nishiyama, Kazuo Tsuchiya, and Katsuyoshi Tsujita Mapping Correlation Matrix Memory Applications onto a Beowulf Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 Michael Weeks, Jim Austin, Anthony Moulds, Aaron Turner, Zygmunt Ulanowski, and Julian Young Accelerating RBF Network Simulation by Using Multimedia Extensions of Modern Microprocessors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 Alfred Strey and Martin Bange A GameTheoretic Adaptive Categorization Mechanism for ARTType Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 Waikeung Fung and Yunhui Liu Gaussian Radial Basis Functions and InnerProduct Spaces . . . . . . . . . . . . . 177 Irwin W. Sandberg
Table of Contents
XI
Mixture of Probabilistic Factor Analysis Model and Its Applications . . . . . . 183 Masahiro Tanaka Deferring the Learning for Better Generalization in Radial Basis Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 Jos´e Mar´ıa Valls, Pedro Isasi, and In´es Mar´ıa Galv´ an Improvement of Cluster Detection and Labeling Neural Network by Introducing Elliptical Basis Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196 Christophe Lurette and St´ephane Lecoeuche Independent Variable Group Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203 Krista Lagus, Esa Alhoniemi, and Harri Valpola Weight Quantization for Multilayer Perceptrons Using Soft Weight Sharing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211 Fatih K¨ oksal, Ethem Alpaydın, and G¨ unhan D¨ undar VotingMerging: An Ensemble Method for Clustering . . . . . . . . . . . . . . . . . . . 217 Evgenia Dimitriadou, Andreas Weingessel, and Kurt Hornik The Application of Fuzzy ARTMAP in the Detection of Computer Network Attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225 James Cannady and Raymond C. Garcia Transductive Learning: Learning Iris Dataset with Two Labeled Data . . . . 231 Chun Hung Li and Pong Chi Yuen Approximation of TimeVarying Functions with Local Regression Models . 237 Achim Lewandowski and Peter Protzel
Theory Complexity of Learning for Networks of Spiking Neurons with Nonlinear Synaptic Interactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247 Michael Schmitt Product Unit Neural Networks with Constant Depth and Superlinear VC Dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253 Michael Schmitt Generalization Performances of Perceptrons . . . . . . . . . . . . . . . . . . . . . . . . . . . 259 G´erald Gavin Bounds on the Generalization Ability of Bayesian Inference and Gibbs Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265 Olivier Teytaud and H´el`ene PaugamMoisy
XII
Table of Contents
Learning Curves for Gaussian Processes Models: Fluctuations and Universality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271 D¨ orthe Malzahn and Manfred Opper Tight Bounds on Rates of NeuralNetwork Approximation . . . . . . . . . . . . . . 277 Vˇera K˚ urkov´ a and Marcello Sanguineti
Kernel Methods Scalable Kernel Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285 Volker Tresp and Anton Schwaighofer OnLine Learning Methods for Gaussian Processes . . . . . . . . . . . . . . . . . . . . . 292 Shigeyuki Oba, Masaaki Sato, and Shin Ishii Online Approximations for WindField Models . . . . . . . . . . . . . . . . . . . . . . . . . 300 Lehel Csat´ o, Dan Cornford, and Manfred Opper Fast Training of Support Vector Machines by Extracting Boundary Data . . 308 Shigeo Abe and Takuya Inoue Multiclass Classiﬁcation with Pairwise Coupled Neural Networks or Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314 Eddy Nicolas Mayoraz Incremental Support Vector Machine Learning: A Local Approach . . . . . . . 322 Liva Ralaivola and Florence d’Alch´eBuc Learning to Predict the LeaveOneOut Error of Kernel Based Classiﬁers . . 331 Koji Tsuda, Gunnar R¨ atsch, Sebastian Mika, and KlausRobert M¨ uller Sparse Kernel Regressors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339 Volker Roth Learning on Graphs in the Game of Go . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347 Thore Graepel, Mike Goutri´e, Marco Kr¨ uger, and Ralf Herbrich Nonlinear Feature Extraction Using Generalized Canonical Correlation Analysis . . . . . . . . . . . . . . . . . . . . . . 353 Thomas Melzer, Michael Reiter, and Horst Bischof Gaussian Process Approach to Stochastic Spiking Neurons with Reset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 361 Kenichi Amemori and Shin Ishii Kernel Based Image Classiﬁcation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369 Olivier Teytaud and David Sarrut Gaussian Processes for Model Fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376 Mohammed A. ElBeltagy and W. Andy Wright
Table of Contents
XIII
Kernel Canonical Correlation Analysis and Least Squares Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . 384 Tony Van Gestel, Johan A.K. Suykens, Jos De Brabanter, Bart De Moor, and Joos Vandewalle Learning and Prediction of the Nonlinear Dynamics of Biological Neurons with Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 390 Thomas Frontzek, Thomas Navin Lal, and Rolf Eckmiller CloseClassSet Discrimination Method for Recognition of Stop Consonant Vowel Utterances Using Support Vector Machines . . . . 399 Chellu Chandra Sekhar, Kazuya Takeda, and Fumitada Itakura Linear Dependency between and the Input Noise in Support Vector Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405 James T. Kwok The Bayesian Committee Support Vector Machine . . . . . . . . . . . . . . . . . . . . . 411 Anton Schwaighofer and Volker Tresp
Topographic Mapping Using Directional Curvatures to Visualize Folding Patterns of the GTM Projection Manifolds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 421 Peter Tiˇ no, Ian Nabney, and Yi Sun Self Organizing Map and Sammon Mapping for Asymmetric Proximities . . 429 Manuel MartinMerino and Alberto Mu˜ noz Active Learning with Adaptive Grids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 436 Michele Milano, J¨ urgen Schmidhuber, and Petros Koumoutsakos Complex Process Visualization through Continuous Feature Maps Using Radial Basis Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443 Ignacio D´ıaz, Alberto B. Diez, and Abel A. Cuadrado Vega A Soft kSegments Algorithm for Principal Curves . . . . . . . . . . . . . . . . . . . . . 450 Jakob J. Verbeek, Nikos Vlassis, and Ben Kr¨ ose Product Positioning Using Principles from the SelfOrganizing Map . . . . . . 457 Chris Charalambous, George C. Hadjinicola, and Eitan Muller Combining the SelfOrganizing Map and KMeans Clustering for OnLine Classiﬁcation of Sensor Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464 Kristof Van Laerhoven Histogram Based Color Reduction through SelfOrganized Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 470 Antonios Atsalakis, Ioannis Andreadis, and Nikos Papamarkos
XIV
Table of Contents
Sequential Learning for SOM Associative Memory with Map Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 477 Motonobu Hattori, Hiroya Arisumi, and Hiroshi Ito Neighborhood Preservation in Nonlinear Projection Methods: An Experimental Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 485 Jarkko Venna and Samuel Kaski A Topological Hierarchical Clustering: Application to Ocean Color Classiﬁcation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 492 M´eziane Yacoub, Fouad Badran, and Sylvie Thiria Hierarchical Clustering of Document Archives with the Growing Hierarchical SelfOrganizing Map . . . . . . . . . . . . . . . . . . . . 500 Michael Dittenbach, Dieter Merkl, and Andreas Rauber
Independent Component Analysis Blind Source Separation of Single Components from Linear Mixtures . . . . . 509 Roland Vollgraf, Ingo Schießl, and Klaus Obermayer Blind Source Separation Using Principal Component Neural Networks . . . . 515 Konstantinos I. Diamantaras Blind Separation of Sources by Diﬀerentiating the Output Cumulants and Using Newton’s Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 521 Rub´en Mart´ınClemente, Jos´e I. Acha, and Carlos G. Puntonet Mixtures of Independent Component Analysers . . . . . . . . . . . . . . . . . . . . . . . . 527 Stephen J. Roberts and William D. Penny Conditionally Independent Component Extraction for Naive Bayes Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 535 Shotaro Akaho Fast Score Function Estimation with Application in ICA . . . . . . . . . . . . . . . . 541 Nikos Vlassis Health Monitoring with Learning Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 547 Alexander Ypma, Co Melissant, Ole BaunbækJensen, and Robert P.W. Duin Breast Tissue Classiﬁcation in Mammograms Using ICA Mixture Models . 554 Ioanna Christoyianni, Athanasios Koutras, Evangelos Dermatas, and George Kokkinakis Neural Network Based Blind Source Separation of Nonlinear Mixtures . . . 561 Athanasios Koutras, Evangelos Dermatas, and George Kokkinakis
Table of Contents
XV
Feature Extraction Using ICA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 568 Nojun Kwak, ChongHo Choi, and Jin Young Choi
Signal Processing Continuous Speech Recognition with a Robust Connectionist/Markovian Hybrid Model . . . . . . . . . . . . . . . . . 577 Edmondo Trentin and Marco Gori Faster Convergence and Improved Performance in LeastSquares Training of Neural Networks for Active Sound Cancellation . . . . . . . . . . . . . . . . . . . . . . 583 Martin Bouchard Bayesian Independent Component Analysis as Applied to OneChannel Speech Enhancement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 593 Ilyas Potamitis, Nikos Fakotakis, and George Kokkinakis Massively Parallel Classiﬁcation of EEG Signals Using MinMax Modular Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 601 BaoLiang Lu, Jonghan Shin, and Michinori Ichikawa Single Trial Estimation of Evoked Potentials Using Gaussian Mixture Models with Integrated Noise Component . . . . . . . 609 Arthur Flexer, Herbert Bauer, Claus Lamm, and Georg Dorﬀner A Probabilistic Approach to HighResolution Sleep Analysis . . . . . . . . . . . . . 617 Peter Sykacek, Stephen Roberts, Iead Rezek, Arthur Flexer, and Georg Dorﬀner Comparison of Wavelet Thresholding Methods for Denoising ECG signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 625 Vladimir Cherkassky and Steven Kilts Evoked Potential Signal Estimation Using Gaussian Radial Basis Function Network . . . . . . . . . . . . . . . . . . . . . . . 630 G. Sita and A.G. Ramakrishnan ‘Virtual Keyboard’ Controlled by Spontaneous EEG Activity . . . . . . . . . . . . 636 Bernhard Obermaier, Gernot M¨ uller, and Gert Pfurtscheller Clustering of EEGSegments Using Hierarchical Agglomerative Methods and SelfOrganizing Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 642 David Sommer and Martin Golz Nonlinear Signal Processing for Noise Reduction of Unaveraged Single Channel MEG data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 650 Wei Lee Woon and David Lowe
XVI
Table of Contents
Time Series Processing A Discrete Probabilistic Memory Model for Discovering Dependencies in Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 661 Sepp Hochreiter and Michael C. Mozer Applying LSTM to Time Series Predictable through TimeWindow Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 669 Felix A. Gers, Douglas Eck, and J¨ urgen Schmidhuber Generalized Relevance LVQ for Time Series . . . . . . . . . . . . . . . . . . . . . . . . . . . 677 Marc Strickert, Thorsten Bojer, and Barbara Hammer Unsupervised Learning in LSTM Recurrent Neural Networks . . . . . . . . . . . . 684 Magdalena KlapperRybicka, Nicol N. Schraudolph, and J¨ urgen Schmidhuber Applying Kernel Based Subspace Classiﬁcation to a Nonintrusive Monitoring for Household Electric Appliances . . . . . . . . . 692 Hiroshi Murata and Takashi Onoda Neural Networks in Circuit Simulators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 699 Alessio Plebe, A. Marcello Anile, and Salvatore Rinaudo Neural Networks Ensemble for Cyclosporine Concentration Monitoring . . . 706 Gustavo Camps, Emilio Soria, Jos´e D. Mart´ın, Antonio J. Serrano, Juan J. Ruixo, and N. V´ıctor Jim´enez Eﬃcient Hybrid Neural Network for Chaotic Time Series Prediction . . . . . . 712 Hirotaka Inoue, Yoshinobu Fukunaga, and Hiroyuki Narihisa Online SymbolicSequence Prediction with DiscreteTime Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . 719 Juan Antonio P´erezOrtiz, Jorge CaleraRubio, and Mikel L. Forcada Prediction Systems Based on FIR BP Neural Networks . . . . . . . . . . . . . . . . . 725 Stanislav Kaleta, Daniel Novotn´y, and Peter Sinˇc´ ak On the Generalization Ability of Recurrent Networks . . . . . . . . . . . . . . . . . . . 731 Barbara Hammer FiniteState Reber Automaton and the Recurrent Neural Networks Trained in Supervised and Unsupervised Manner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 737 ˇ nansk´y and Lubica Beˇ Michal Cerˇ nuˇskov´ a Estimation of Computational Complexity of Sensor Accuracy Improvement Algorithm Based on Neural Networks . . . . . . . . . . . . . . . . . . . . 743 Volodymyr Turchenko, Volodymyr Kochan, and Anatoly Sachenko
Table of Contents
XVII
Fusion Architectures for the Classiﬁcation of Time Series . . . . . . . . . . . . . . . . 749 Christian Dietrich, Friedhelm Schwenker, and G¨ unther Palm
Special Session: AgentBased Economic Modeling The Importance of Representing Cognitive Processes in Multiagent Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 759 Bruce Edmonds and Scott Moss Multiagent FXMarket Modeling Based on Cognitive Systems . . . . . . . . . . . 767 Georg Zimmermann, Ralph Neuneier, and Ralph Grothmann Speculative Dynamics in a HeterogeneousAgent Model . . . . . . . . . . . . . . . . . 775 Taisei Kaizoji Nonlinear Adaptive Beliefs and the Dynamics of Financial Markets: The Role of the Evolutionary Fitness Measure . . . . . . . . . . . . . . . . . . . . . . . . . 782 Andrea Gaunersdorfer and Cars H. Hommes Analyzing Purchase Data by A Neural Net Extension of the Multinomial Logit Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 790 Harald Hruschka, Werner Fettes, and Markus Probst
Selforganization and Dynamical Systems Using Maximal Recurrence in Linear Threshold Competitive Layer Networks . . . . . . . . . . . . . . . . . . . . . . 799 Heiko Wersing and Helge Ritter Exponential Transients in ContinuousTime Symmetric Hopﬁeld Nets . . . . . 806 ˇıma and Pekka Orponen Jiˇr´ı S´ Initial Evolution Results on CAMBrain Machines (CBMs) . . . . . . . . . . . . . . 814 Hugo de Garis, Andrzej Buller, Leo de Penning, Tomasz Chodakowski, and Derek Decesare SelfOrganizing Topology Evolution of Turing Neural Networks . . . . . . . . . . 820 Christof Teuscher and Eduardo Sanchez Eﬃcient Pattern Discrimination with Inhibitory WTA Nets . . . . . . . . . . . . . 827 Brijnesh J. Jain and Fritz Wysotzki Cooperative Information Control to Coordiante Competition and Cooperation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 835 Ryotaro Kamimura and Taeko Kamimura Qualitative Analysis of a Continuous ComplexValued Associative Memories . . . . . . . . . . . . . . . . . 843 Yasuaki Kuroe, Naoki Hashimoto, and Takehiro Mori
XVIII Table of Contents
Self Organized Partitioning of Chaotic Attractors for Control . . . . . . . . . . . . 851 Nils Goerke, Florian Kintzler, and Rolf Eckmiller A Generalisable Measure of SelfOrganisation and Emergence . . . . . . . . . . . . 857 W. Andy Wright, Robert E. Smith, Martin Danek, and Pillip Greenway MarketBased Reinforcement Learning in Partially Observable Worlds . . . . 865 Ivo Kwee, Marcus Hutter, and J¨ urgen Schmidhuber Sequential Strategy for Learning Multistage Multiagent Collaborative Games . . . . . . . . . . . . . . 874 W. Andy Wright
Robotics and Control Neural Architecture for Mental Imaging of Sequences Based on Optical Flow Predictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 885 Volker Stephan and HorstMichael Gross Visual Checking of Grasping Positions of a ThreeFingered Robot Hand . . 891 Gunther Heidemann and Helge Ritter AnticipationBased Control Architecture for a Mobile Robot . . . . . . . . . . . . 899 Andrea Heinze and HorstMichael Gross Neural Adaptive Force Control for Compliant Robots . . . . . . . . . . . . . . . . . . . 906 N. Saadia, Y. Amirat, J. Pontnaut, and A. RamdaneCherif A Design of NeuralNet Based SelfTuning PID Controllers . . . . . . . . . . . . . 914 Michiyo Suzuki, Toru Yamamoto, Kazuo Kawada, and Hiroyuki Sogo Kinematic Control and Obstacle Avoidance for Redundant Manipulators Using a Recurrent Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 922 Wai Sum Tang, Cherry Miu Ling Lam, and Jun Wang Adaptive Neural Control of Nonlinear Systems . . . . . . . . . . . . . . . . . . . . . . . . . 930 Ieroham Baruch, Jose Martin Flores, Federico Thomas, and Ruben Garrido A Hierarchical Method for Training Embedded Sigmoidal Neural Networks . . . . . . . . . . . . . . . . . . . . . 937 Jinglu Hu and Kotaro Hirasawa Towards Learning Path Planning for Solving Complex Robot Tasks . . . . . . 943 Thomas Frontzek, Thomas Navin Lal, and Rolf Eckmiller Hammerstein Model Identiﬁcation Using Radial Basis Functions Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . 951 Hussain N. AlDuwaish and Syed Saad Azhar Ali
Table of Contents
XIX
Evolving Neural Behaviour Control for Autonomous Robots . . . . . . . . . . . . . 957 Martin H¨ ulse, Bruno Lara, Frank Pasemann, and Ulrich Steinmetz Construction by Autonomous Agents in a Simulated Environment . . . . . . . . 963 Anand Panangadan and Michael G. Dyer A Neural Control Model Using Predictive Adjustment Mechanism of Viscoelastic Property of the Human Arm . . . . . . . . . . . . . . . . . . . . . . . . . . . 971 Masazumi Katayama Multijoint Arm Trajectory Formation Based on the Minimization Principle Using the EulerPoisson Equation . . . . . . . . . 977 Yasuhiro Wada, Yuichi Kaneko, Eri Nakano, Rieko Osu, and Mitsuo Kawato
Vision and Image Processing Neocognitron of a New Version: Handwritten Digit Recognition . . . . . . . . . . 987 Kunihiko Fukushima A Comparison of Classiﬁers for RealTime Eye Detection . . . . . . . . . . . . . . . 993 Alex Cozzi, Myron Flickner, Jainchang Mao, and Shivakumar Vaithyanathan Neural Network Analysis of Dynamic ContrastEnhanced MRI Mammography . . . . . . . . . . . . . . . . . . 1000 Axel Wism¨ uller, Oliver Lange, Dominik R. Dersch, Klaus Hahn, and Gerda L. Leinsinger A New Adaptive Color Quantization Technique . . . . . . . . . . . . . . . . . . . . . . . 1006 Antonios Atsalakis, Nikos Papamarkos, and Charalambos Strouthopoulos Tunable Oscillatory Network for Visual Image Segmentation . . . . . . . . . . . . 1013 Margarita G. Kuzmina, Eduard A. Manykin, and Irina I. Surina Detecting Shot Transitions for Video Indexing with FAM . . . . . . . . . . . . . . 1020 SeokWoo Jang, GyeYoung Kim, and HyungIl Choi Finding Faces in Cluttered Still Images with Few Examples . . . . . . . . . . . . 1026 Jan Wieghardt and Hartmut S. Loos Description of Dynamic Structured Scenes by a SOM/ARSOM Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1034 Antonio Chella, Maria Donatella Guarino, and Roberto Pirrone Evaluation of Distance Measures for Partial Image Retrieval Using SelfOrganising Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1042 Yin Huang, Ponnuthurai N. Suganthan, Shankar M. Krishnan, and Xiang Cao
XX
Table of Contents
Video Sequence Boundary Detection Using Neural Gas Networks . . . . . . . . 1048 Xiang Cao and Ponnuthurai N. Suganthan A NeuralNetworkBased Approach to Adaptive Human Computer Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . 1054 George Votsis, Nikolaos D. Doulamis, Anastasios D. Doulamis, Nicolas Tsapatsoulis, and Stefanos D. Kollias Adaptable Neural Networks for Unsupervised Video Object Segmentation of Stereoscopic Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1060 Anastasios D. Doulamis, Klimis S. Ntalianis, Nikolaos D. Doulamis, and Stefanos D. Kollias
Computational Neuroscience A Model of BorderOwnership Coding in Early Vision . . . . . . . . . . . . . . . . . 1069 Masayuki Kikuchi and Youhei Akashi Extracting Slow Subspaces from Natural Videos Leads to Complex Cells . 1075 Christoph Kayser, Wolfgang Einh¨ auser, Olaf D¨ ummer, Peter K¨ onig, and Konrad K¨ ording Neural Coding of Dynamic Stimuli . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1081 Stefan D. Wilke Resonance of a Stochastic Spiking Neuron Mimicking the HodgkinHuxley Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1087 Kenichi Amemori and Shin Ishii Spike and Burst Synchronization in a Detailed Cortical Network Model with IF Neurons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1095 Baran C ¸u ¨r¨ ukl¨ u and Anders Lansner Using Depressing Synapses for Phase Locked Auditory Onset Detection . . 1103 Leslie S. Smith Controlling Oscillatory Behaviour of a Two Neuron Recurrent Neural Network Using Inputs . . . . . . . . . . . . . . 1109 Robert Haschke, Jochen J. Steil, and Helge Ritter Temporal Hebbian Learning in RateCoded Neural Networks: A Theoretical Approach towards Classical Conditioning . . . . . . . . . . . . . . . . 1115 Bernd Porr and Florentin W¨ org¨ otter A Mathematical Analysis of a Correlation Based Model for the Orientation Map Formation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1121 Tadashi Yamazaki Learning from Chaos: A Model of Dynamical Perception . . . . . . . . . . . . . . . 1129 Emmanuel Dauc´e
Table of Contents
XXI
Episodic Memory and Cognitive Map in a Rate Model Network of the Rat Hippocampus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1135 ´ Fanni Misj´ ak, M´ at´e Lengyel, and P´eter Erdi A Model of Horizontal 360◦ Object Localization Based on Binaural Hearing and Monocular Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1141 Carsten Schauer and HorstMichael Gross SelfOrganization of Orientation Maps, Lateral Connections, and Dynamic Receptive Fields in the Primary Visual Cortex . . . . . . . . . . . 1147 Cornelius Weber Markov Chain Model Approximating the HodgkinHuxley Neuron . . . . . . . 1153 Yuichi Sakumura, Norio Konno, and Kazuyuki Aihara
Connectionist Cognitive Science A Neural Oscillator Model of Auditory Attention . . . . . . . . . . . . . . . . . . . . . 1163 Stuart N. Wrigley and Guy J. Brown Coupled Neural Maps for the Origins of Vowel Systems . . . . . . . . . . . . . . . . 1171 Pierreyves Oudeyer Learning for Text Summarization Using Labeled and Unlabeled Sentences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1177 MassihReza Amini and Patrick Gallinari OnLine Error Detection of Annotated Corpus Using Modular Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1185 Qing Ma, BaoLiang Lu, Masaki Murata, Michinori Ichikawa, and Hitoshi Isahara InstanceBased Method to Extract Rules from Neural Networks . . . . . . . . . 1193 DaeEun Kim and Jaeho Lee A Novel Binary Spell Checker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1199 Victoria J. Hodge and Jim Austin Neural Nets for Short Movements in Natural Language Processing . . . . . . 1205 Neill Taylor and John Taylor Using Document Features to Optimize Web Cache . . . . . . . . . . . . . . . . . . . . 1211 Timo Koskela, Jukka Heikkonen, and Kimmo Kaski Generation of Diversiform Characters Using a Computational Handwriting Model and a Genetic Algorithm . . . . 1217 Yasuhiro Wada, Kei Ohkawa, and Keiichi Sumita Information Maximization and Language Acquistion . . . . . . . . . . . . . . . . . . . 1225 Ryotaro Kamimura and Taeko Kamimura
XXII
Table of Contents
A Mirror Neuron System for Syntax Acquisition . . . . . . . . . . . . . . . . . . . . . . 1233 Steve Womble and Stefan Wermter A Network of Relaxation Oscillators that Finds Downbeats in Rhythms . . 1239 Douglas Eck Knowledge Incorporation and Rule Extraction in Neural Networks . . . . . . 1248 Minoru Fukumi, Yasue Mitsukura, and Norio Akamatsu Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1255
Invited Papers
This page intentionally left blank
The Complementary Brain (Abstract) Stephen Grossberg Center for Adaptive Systems and Department of Cognitive and Neural Systems Boston University 677 Beacon Street, Boston, MA 02215, USA
[email protected] How are our brains functionally organized to achieve adaptive behavior in a changing world? This talk will survey evidence supporting a computational paradigm that radically departs from the computer metaphor suggesting that brains are organized into independent modules. Evidence is reviewed that brains are organized into parallel processing streams with complementary properties. This perspective clariﬁes, for example, how parallel processing in the brain leads to visual percepts and object recognition, and how perceptual mechanisms diﬀer from those of movement control. Multiple modeling studies predict how the neocortex is organized into parallel processing streams such that pairs of streams obey complementary computational rules (as when two puzzle pieces ﬁt together). Hierarchical interactions within each stream and parallel interactions between streams create coherent behavioral representations that overcome the complementary deﬁciencies of each stream and support unitary conscious experiences. For example, visual boundaries and surfaces seem to obey complementary rules in the Interblob and Blob streams from area V1 to V4. They interact to generate visible representations of 3D surfaces in which partially occluded objects are mutually separated and completed. Visual boundaries and motion in the Interblob and Magnocellular cortical processing streams seem to obey complementary computational rules in cortical areas V2 and MT. They interactively form representations of object motion in depth. Predictive target tracking (e.g., targets moving relative to a stationary observer) and optic ﬂow navigation (e.g., an observer moving relative to its world) seem to obey complementary computational rules in ventral and dorsal MST. These regions interact to track moving targets, and to determine an observer’s heading and timetocontact. Spatiallyinvariant object recognition seems to obey complementary computational rules as compared to spatial representation and the control of action in Inferotemporal and Parietal cortex, where they, respectively, learn to stably recognize objects in a changing world, and rapidly relearn spatial and action parameters when motor parameters change. Their interaction enables spatiallyinvariant object recognition categories of valued objects (in the What stream) to direct spatial attention and actions (in the Where stream) towards these objects in space. These results help to quantitatively simulate many data and to make surprising predictions. They provide a more global perspective on how the brain controls G. Dorﬀner, H. Bischof, and K. Hornik (Eds.): ICANN 2001, LNCS 2130, pp. 3–4, 2001. c SpringerVerlag Berlin Heidelberg 2001
4
Stephen Grossberg
behavior, and what are the brain’s computational units that have behavioral signiﬁcance, including conscious perceptual representations. In particular, the role of synchronous processing in binding complementary properties together, and how it may persist, collapse, and be reset in temporal cycles of coherent and incoherent processing, will be noted. These insights also provide new designs for technological applications wherein adaptive autonomous control in uncertain and unpredictable environments is needed.
Acknowledgement Supported in part by AFOSR, DARPA, NSF, and ONR.
References Grossberg 2000. Grossberg S., The Complementary Brain: Unifying Brain Dynamics and Modularity, Trends in Cognitive Sciences, 4, 233245, 2000.
Neural Networks for Adaptive Processing of Structured Data Alessandro Sperduti Dip. di Informatica, Universit a di Pisa, Corso Italia, 40, 56125 Pisa, Italy
Structured domains are characterized by complex patterns which are usually represented as lists, trees, and graphs of variable sizes and complexity. The ability to recognize and classify these patterns is fundamental for several applications that use, generate or manipulate structures. In this paper I review some of the concepts underpinning Recursive Neural Networks, i.e. neural network models able to deal with data represented as directed acyclic graphs. Abstract.
1
Introduction
The processing of structured data is usually con ned to the domain of symbolic systems. Recently, however, there has been some eort in trying to extend the computational capabilities of neural networks to structured domains. While earlier neural approaches were able to deal with some aspects of processing of structured information, none of them established a practical and eÆcient way of dealing with structured information. A more powerful approach, at least for classi cation and prediction tasks, was proposed in [15] and further extended in [7]. In these works, Recursive Neural Networks, a generalization of recurrent neural networks for processing sequences to the case of directed graphs, were de ned. These models are able to learn a mapping from a domain of ordered (or positional) directed acyclic graphs, with labels attached to each node, to the set of real numbers. The basic idea behind the models is the extension of the concept of unfolding from the domain of sequences to the domain of directed ordered graphs (DOGs). In this paper, I brie y present some of the basic concepts underpinning Recursive Neural Networks. Some supervised and unsupervised models are presented, together with an outlook of the main computational, complexity, and leaning results obtained up to now. The possibility of processing structured information using neural networks is appealing for several reasons. First of all, neural networks are universal approximators; in addition, they are able to learn from a set of examples and very often, by using the correct methodology for training, they are able to reach a quite high generalization performance. Finally, they are able to deal with noise and incomplete, or even ambiguous, data. All these capabilities are particularly useful when dealing with prediction tasks where data are usually gathered experimentally, and thus are partial, noisy, and incomplete. A typical example of G. Dorﬀner, H. Bischof, and K. Hornik (Eds.): ICANN 2001, LNCS 2130, pp. 5–12, 2001. c SpringerVerlag Berlin Heidelberg 2001
6
Alessandro Sperduti
such a domain is chemistry, where compounds can naturally be represented as labeled graphs.
2
Data Structures and Notation
In this paper we assume that instances in the learning domain are DOAGs (directed ordered acyclic graphs) or DPAGs (directed positional acyclic graphs). A DOAG is a DAG D with vertex set vert(D) and edge set egd(D), where for each vertex v 2 vert(D) a total order on the edges leaving from v is de ned. DPAGs are a superclass of DOAGs in which it is assumed that for each vertex v , a bijection P : egd(D) ! IN is de ned on the edges leaving from v . The indegree of node v is the number of incoming edges to v , whereas the outdegree of v is the number of outgoing edges from v . We shall require the DAG (either DOAG or DPAG) to possess a supersource1 , i.e. a vertex s 2 vert(D) such that every vertex in vert(D) can be reached by a directed path starting from s. Given a DAG D and v 2 vert(D), we denote by ch[v ] the set of children of v , and by chk [v ] the kth child of v . We shall use lowercase bold letters to denote vectors, uppercase bold letters to denote matrices, and calligraphic letters for representing graphs. A data structure Y is a DAG whose vertices are labeled by vectors of realvalued numbers which either represent numerical or categorical variables. Subscript notation will be used when referencing the labels attached to vertices in a data structure. Hence y v denotes the vector of variables labeling vertex v 2 vert(Y ). In the following, we shall denote by #(i;c) the class of DAGs with maximum indegree i and maximum outdegree c. A generic class of DAGs with bounded (but unspeci ed) indegree and outdegree, will simply be denoted by #. The class of all data structures de ned over the label universe domain Y and skeleton in (i;c) #(i;c) will be denoted as Y # . The void DAG will be denoted by the special symbol .
3
Recursive Neural Networks
Recursive Neural Networks are neural networks able to perform mappings from a set of labeled graphs to the set of real vectors. Speci cally, the class of functions which can be realized by a recursive neural network can be characterized as the class of functional graph transductions : I # ! IRk , where I = IRm , which can be represented in the following form = g Æ b, where b : I # ! IRn is the encoding (or state transition) function and g : IRn ! IRk is the output function. Speci cally, given a DOAG D, b is de ned recursively as
b
(D ) 1
=
0 (the null vector in IRn ) if D = (y s ; b(D(1) ); : : : ; b(D(c) )) otherwise
(1)
If no supersource is present, a new node connected with all the nodes of the graph with null
indegree
can be added.
Neural Networks for Adaptive Processing of Structured Data
where
is de ned as
:
IR
m

IR
n
{z
IR
}
n
!
IR
n
where
IR
m
7
denotes the
c times label space, while the remaining domains represent the encoded subgraphs spaces up to the maximum outdegree of the input domain I # , c is the maximum outdegree of DOAGs in I # , s = source(D), y s is the label attached to the supersource of D, and D(1) ; : : : ; D(c) are the subgraphs pointed by s. A typical neural realization for is (uv ; x(1) ; : : : ; x(c) )
= F (Buv +
X A x( ) + ); c
j
=1
j
(2)
j
where F i (v ) = f (vi ) (sigmoidal function), uv 2 IRm is a label, 2 IRn is the bias vector, B 2 IRmn is the weight matrix associated with the label space, x(j ) 2 IRm are the vectorial codes obtained by the application of the encoding function to the subgraphs D(j ) (i.e., x(j ) = (D(j ) )), and Aj 2 IRmm is the weight matrix associated with the j th subgraph space. Concerning the output function g , it can be de ned as a map g : IRn ! IRk , and in general, it is realized by a feedforward network. The encoding process of an input graph can be represented graphically by unfolding equation 1 through the input graph, and using equation 2, obtaining in this way the so called encoding network. An example of encoding network obtained by using two recursive neurons an a single output neuron is shown in Figure 1. The output of the encoding network depends on the values of the weights and it will used as a numerical vectorial code to represent the input graph. This code will then be further processed by the parametric function g () (in the example, the single output neuron) so to obtain the desired regression (or classi cation) value for the speci c input graph. Given a cost function E and a training set T = f(Di ; ti )gi=1;::;L , where for each data structure Di a desired target value ti is associated, using eq. 1, eq. 2, and the neural network implementing the output function g , for each Di a feedforward network can be generated and trained so to match the corresponding desired target value ti . Since the weights are shared among all the generated feedforward networks, the training converges to a set of weight values which reproduces the desired target value for each data structure in the training set. For each Dl , its vertexes are enumerated according to a chosen inverse topological order as v~1 ; v~2 ; : : : ; v~Pl . Moreover, let v^1 ; v^2 ; : : : ; v^h be the set of vertexes, belonging to any DAG in the training set, for which a target value is de ned. For simplicity, here we assume that only the supersources have a target de ned (thus : h = L). U = [U 1 ; ; U L ] 2 Rm+1;P collects all the labels of the data struc: tures (including the bias components always to 1), where P = L Pl . Simil=1 ( k) larly, for each graph Dl , and each k 2 [1; : : : ; c], the matrices X l 2 IRn;Pl and : de ned as X (l k) = [xl;chk [~v1 ] ; : : : ; xl;chk [~vPl ] ], where xl;chk [~vi ] = x0 if chk [~vi ] = ;, collect the status information for each pointer k . All the information concerning : a pointer k is stored into matrices X (k) = X (1k) ; : : : ; X (Lk) 2 IRn;P . Moreover, we de ne the matrix collecting the information needed to compute the output
b
b
P
h
i
8
Alessandro Sperduti out
δ 11 v1
[1,1,1]
1
v2
[1,0,0]
x 11
δ 12
v3
1
v4
C
δ 11
2
[0,0,1]
y 11
[0,0,1]
B
A1
[1,1,1] u 11 B
x 12 δ A 1 14 B A2
order: v 2 v 4 v 3 v 1
x 13 A1 A2
x14
[0,0] x 0 A1 A2
[0,0,1] [0,0] [0,0] [1,0,0] u 12 x 0 x 0 u 13 B
Input DOAG
δ 13
A2
Encoding Network [0,0,1] [0,0] [0,0] u 14
x0
x0
Fig. 1. Example of DOAG with corresponding encoding network.
associated to the training graphs as
X
target
=
x x v ^1
v ^2
x L . Let us de ne v ^
1 1 1 : Æilv = @El
[email protected] . For node v of a given DOAG Dl , the corresponding delta : error Æilv for the state variables can be collected in vector lv = [Æ1lv ; : : : ; Ænlv ]t and the contributions from all the graph's nodes can be collected in matrix : Rn;Pl , where the order of the columns follows the inverse l = [ l;1 ; : : : ; l;Pl ] 2 I : topological order chosen for the graph. Finally, = [ 1 ; : : : ; L ] 2 IRn;P contains the delta errors for all the graphs of the learning environment, whereas : @ El out the delta error corresponding to the output unit is denoted by Æls = @ net ls and collected into the matrix ^ . The gradient of the cost can be calculated by using Backpropagation in each encoding network, that is by propagating the error through the given structure, similarly to what happens in recurrent networks when processing sequences. The gradient of the cost i be written in h can : @E a compact form by using the vectorial notation B = @ bij 2 IRm+1;n and i h : n;n @E 2 IR . Based on these de nitions, B and Ak can be comAk = @ aijk puted as follows
Æ
Æ
Æ
G
G
G
B
G
=
L P l=1
k =
A
G
L P l=1
B;l
G
=
L P l=1 v
k =
A ;l
P
2vert(Dl )
L P l=1 v
uÆ
P
2vert(Dl )
lv
x
=
t lv
G
U ;
k [v]
l;ch
G
t
Æ
t lv
=
X ; (k)
t
(3)
Neural Networks for Adaptive Processing of Structured Data
G
G
9
where B;l ; Ak ;l are the gradient contributions corresponding to El , that is : to DOAG l . Speci cally, let Qk (v ) = u chk [u] = v . The deltaerror Æilv can be computed recursively according to the following equation
D
f j
Æilv
= 0 (netilv )
o
n
akji
0 @ X
k=1 j =1
P
;
XX
g
z
Æjlz
1 A
(4)
2Qk (v)
where if Qk (v ) = then z2Qk (v) Æjlz = 0. The above equation can be rewritten in compact form as
0 X @ X A o
Æ lv = J lv
k
k=1
z
Æ lz
1 A;
(5)
2Qk (v)
where J lv is a diagonal matrix with elements [J lv ]ii = 0 (netilv ). Moreover, by applying recursively equation (5), we obtain Æ lv
0
[email protected] X 2P athsl (s;v)
p
Y
(s!v )left (u;chk [u])2p
J l;chk [u]
1 A AÆ t k
(6)
ls
D
where P athsl (s; v ) is the set of paths in l from the supersource s to node v , and the product is lefthand starting from the supersource s and ending to node v . This equation gives rise to the BackPropagation Through Structure (BP T S ) gradient computational scheme [9, 15]. Following the same approach, it is not diÆcult to generalize any supervised learning algorithm, such as RTRL, to the treatment of structured data. Constructive algorithms, such as Recurrent CascadeCorrelation, can be generalized as well [15]. A common features of all the supervised models is that, assuming stationarity, causality, and discrete labels, the training set can be optimized so to reduce the computational complexity of training. The basic idea is to represent each subgraph in the training set only once: if two graphs share the same subgraph, this subgraph only needs to be represented once, as well as only once the state associated to it must be computed. Applying this basic idea, the training set can be collapsed into a single minimal DOAG (see Figure 2 for an example) in time which is O(P log P ). 3.1
Unsupervised Learning: SelfOrganizing Maps for Structured Data
A SelfOrganizing Map model for processing structured data has been recently proposed [11]. This model hinges on the recursive encoding idea described in eq. 1. The aim of the SOM learning algorithm is to learn a feature map : which given a vector in the spatially continuous input space returns a point in the spatially discrete output display space . In fact, SOM is performing
I !A
A
I
M
10
Alessandro Sperduti f
f b
f
f
f g
b
a
f g
g
a
g
a
a
b
f g
b
b
Optimization of the training set.
Fig. 2.
data reduction via a vector quantization approach. When the input space is a (i;c) (i;c) c: # structured domain with labels in Y , i.e., # , the map M ! can be realized by the following recursive de nition:
I Y
(
c(D) M
=
Y
nil A Æ
M
c(D(1) ); : : : ; M c(D(c) ) ys ; M
A
if D = otherwise
where nilA is a special coordinate (the void coordinate) into the discrete output Æ space , and M: c ! is a SOM, de ned on a generic node, which takes in input the label of the node and the \encoding" of the subgraphs D(1) ; : : : ; D(c) c map. By \unfolding" the recursive de nition, it turns out according to the M Æ c that M(D) can be computed by starting to apply M to leaf nodes (i.e., nodes Æ with null outdegree), and proceeding with the application of M bottomup from the frontier to the supersource of the graph D. Assuming that each label in is encoded in IRm , for each v 2 vert(D), we have a vector uv of dimension m. Moreover, the display output space , where [1::n1 ] [1::n2 ] [1::nq ], is realized through a q dimensional lattice of neurons. So, the winning neuron is represented by the coordinates (i1 ; ::; iq ). Æ With the above assumptions, we have that M: IRm ([1::n1 ] [1::nq ])c ! Æ [1::n1 ] [1::nq ], and the m + cq dimensional input vector v to M, representing the information about a generic node v , is de ned as v = uv Dch1 [v] Dch2 [v] 2 Dchc [v ] , where Dchi [v ] is the coordinates vector of the winning neuron for the subgraph pointed by the ith pointer of v . Of course, each neuron with coordinates vector cj in the q dimensional lattice has an associated vector weight w cj 2 IRm+cq . The weights associated with each Æ neuron in the q dimensional lattice M can be trained using the standard SOM's c is shown in Figure 3. learning process, while the training algorithm for M The coordinates for the (sub)graphs are stored in Dv , once for each processing Æ of graph D, and then used when needed3 for the training of M. Of course, the
A
Y A
A
Y
A
2 3
U
A
The null pointer nilA can be de ned, for example, by the vector with all components at 1. Notice that the use of an inverted topological order guarantees that the updating of the coordinate vectors Dv is done before the use of Dv for training.
Neural Networks for Adaptive Processing of Structured Data
Unsupervised Stochastic Training Algorithm for input: fDi gi=1;:::;N ,
c,
Æ
randomly set the weights for repeat
List(D) for
v
train( Dv
M c
M;
begin
randomly select
11
D 2
T
Æ
M;
with uniform distribution;
an inverted topological order for vert(D );
first(List( D )) Æ M
Æ
M
u ( u
to
last(List(D)) do
v Dch1 [v ] Dch2 [v ] Dchc [v ] v Dch1 [v ] Dch2 [v ]
Dchc [v ]
);
);
end
Fig. 3. The unsupervised SOM for structured data.
stored vector is an approximation of the true coordinate vector for the graph rooted in v , however, if the learning rate is small, this approximation may be negligible. The call train() refers to a single step of standard SOM training. Preliminary experimental results showed that the model was actually able to cluster similar (sub)structures and to organize the input (sub)graphs according to both information encoded in the labels and in the structures. The computational power of Recursive Neural Networks has been studied in [14] by using hard threshold units and frontiertoroot tree automata. In [3], several strategies for encoding nitestate tree automata in highorder and rstorder sigmoidal recursive neural networks have been proposed. Complexity results on the amount of resources needed to implement frontiertoroot tree automata in recursive neural networks are presented in [10]. Results on function approximation and theoretical analysis of learnability and generalization of recursive neural networks (referred as folding networks) can be found in [12, 13]. Finally, an analysis of suÆcient conditions that guarantee the absence of local minima when training recursive neural networks can be found in [6]. Computational, Complexity, and Learnability Issues
4
Conclusion
Neural networks can deal with structured data. Here I have discussed some of the basic concepts underpinning Recursive Neural Networks. Applications of Recursive Neural Networks are starting to emerge. They include learning searchcontrol heuristics for automated deduction systems [8], logo recognition [5], chemical applications (QSPR/QSAR) [1, 2], incremental parsing of natural language [4]. More work is needed to progress in this eld. For example, no
12
Alessandro Sperduti
accepted proposal for expanding the processing of neural networks to cyclic graphs has yet been established.
References 1. A.M. Bianucci, A. Micheli, A. Sperduti, and A. Starita. Application of cascade correlation networks for structures to chemistry. Applied Intelligence, 12:115{145, 2000. 2. A.M. Bianucci, A. Micheli, A. Sperduti, and A. Starita. Analysis of the internal representations developed by neural networks for structures applied to quantitative structureactivity relationship studies of benzodiazepines. J. Chem. Inf. Comput. Sci., 41(1):202{218, 2001. 3. R.C. Carrasco and M.L. Forcada. Simple strategies to encode tree automata in sigmoid recursive neural networks. IEEE TKDE, 13(2):148{156, 2001. 4. F. Costa, P. Frasconi, V. Lombardo, and G. Soda. Towards incremental parsing of natural language using recursive neural networks. Applied Intelligence, To appear. 5. E. Francesconi, P. Frasconi, M. Gori, S. Marinai, J. Q. Sheng, G. Soda, and A. Sperduti. Logo recognition by recursive neural networks. In R. Kasturi and K. Tombre, editors, Graphics Recognition { Algorithms and Systems, pages 104{117. LNCS, Vol. 1389, Springer Verlag, 1997. 6. P. Frasconi, M. Gori, and A. Sperduti. On the eÆcient classi cation of data structures by neural networks. In IJCAI, pages 1066{1071, 1997. 7. P. Frasconi, M. Gori, and A. Sperduti. A general framework for adaptive processing of data structures. IEEE TNN, 9(5):768, 1998. 8. C. Goller. A Connectionist Approach for Learning SearchControl Heuristics for Automated Deduction Systems. PhD thesis, Technical University Munich, Computer Science, 1997. 9. C. Goller and A. Kuchler. Learning taskdependent distributed structurerepresentations by backpropagation through structure. In IEEE International Conference on Neural Networks, pages 347{352, 1996. 10. M. Gori, A. Kuchler, and A. Sperduti. On the implementation of frontiertoroot tree automata in recursive neural networks. IEEE Transactions on Neural Networks, 10(6):1305 {1314, 1999. 11. M. Hagenbuchner, A. Sperduti, and A. C. Tsoi. A supervised selforganising map for structured data. In WSOM'01, 2001. 12. B. Hammer. On the learnability of recursive data. Mathematics of Control Signals and Systems, 12:62{79, 1999. 13. B. Hammer. Generalization ability of folding networks. IEEE TKDE, 13(2):196{ 206, 2001. 14. A. Sperduti. On the computational power of recurrent neural networks for structures. Neural Networks, 10(3):395{400, 1997. 15. A. Sperduti and A. Starita. Supervised neural networks for the classi cation of structures. IEEE Transactions on Neural Networks, 8(3):714{735, 1997.
Bad Design and Good Performance: Strategies of the Visual System for Enhanced Scene Analysis Florentin W¨ org¨ otter University of Stirling, Department of Psychology, Stirling FK9 4LA, Scotland http://www.cn.stir.ac.uk
Abstract. The visual system of vertebrates is a highly eﬃcient, dynamic sceneanalysis machine even though many aspects of its own design are at a ﬁrst glance rather inconvenient from the viewpoint of an neural network or computer vision engineer. For several of these apparently imperfect design principles, it seems, however, that the system is able to turn things around and instead make a virtue out of them.
I will concentrate on three examples where the design and the actual performace of the visual system seem to mismatch. In particular I will try to show how visual perception in creatures and computers can actually be improved when treating such ”faulty” signals in the ”right way”. Starting with older studies about visual signal transmission delays and their possible use in image segmentation, I will then present novel ideas about the positive eﬀects of noise in visual signals. Finally, I will present data about changing the spatiotemporal resolution of cortical cell responses. For the ﬁrst time, this was measured in an openloop paradigm, achieved by speciﬁcally eliminating the corticothalamic feedback loop from within an otherwise intact cortex. 1) The visual world at the level of a single cortical cell is anything but constant. Receptive ﬁelds of cortical cells very often encounter new situations due to fast (saccadic) eye movements, which occur at a rate of up to 5 Hz, and/or due to object motion in the viewed scene. This process becomes more complicated by the fact that the visual activity reaches higher visual areas only after a certain delay, the visual latency, which is heavily contrast dependent: activity from dark objects arrive much later than from bright objects. Interestingly, normally we do not see a delay between the perception of bright versus dark objects. In a series of older studies, we had suggested that, instead of interfering with perception, visual latency might actually be used to segment images into their dark and bright parts and thereby help in process of object recognition. The naturally arising delay between brightelicited and darkelicited activity is enough to drive such a process. In fact, image segmentation can be accelerated and improved rather dramatically in technical systems when latencies are combined with a spinrelaxation model for image segmentation. 2) Latency diﬀerences may play a role especially after a saccadic eye movement, when the eye stabilizes on a new image. However, even during ﬁxation or G. Dorﬀner, H. Bischof, and K. Hornik (Eds.): ICANN 2001, LNCS 2130, pp. 13–15, 2001. c SpringerVerlag Berlin Heidelberg 2001
14
Florentin W¨ org¨ otter
smooth pursuit, retinal positioning errors and the eﬀects of eyetremor induce target shifts that lead to a rapid change in cortical cell stimulation. In spite of this, we do not notice motion blurring. This eﬀect is so hard to correct in technical systems (cameras) when they have to operate under such adverse conditions. Recording from the visual cortex, we recently found that (simulated) eyetremor superimposed on a moving stimulus drives cortical cells harder compared with a smoothly moving stimulus. We attribute this eﬀect to stochastic resonance and it may well be that this eﬀect leads to a contrast enhancement at edges. Thus, tremor seems to enhance cortical signal amplitudes. In addition to this, we found that tremor can  quite paradoxically  also substantially increase visual resolution. Hyperacuity describes the fact that our visual resolution is better than predicted from the distance between two photoreceptors. One expects that motion noise should lower visual resolution. Interestingly, we shown in a model that hyperacuity can actually be assisted by eyetremor. This is based on the fact that many photoreceptors are randomly moved across the stimulus. The temporal integration properties of the retina and the divergence/convergence structure of the primary visual pathway are also instrumental in this process. Technical systems (camera chips) can in a similar way beneﬁt from tremor  as has been pointed out by Mitros and his coworkers from CALTECH (in press). 3) All these eﬀects arise mainly from the processing properties of our ”visual frontend”, the eye or the retinal network, respectively. However, along the ascending pathways, additional sources, which shape and modify the signals, come into play. At the level of the retina, visual signals (especially from Xcells) still bear a high degree of linearity, but more and more nonlinear disturbances occur higher up in the visual hierarchy. This is mostly due to the recursive action of feedback loops which interfere with ascending signals. This happens for the ﬁrst time in the visual thalamus through the action of the corticothalamic feedback. Already, early models have suggested that such nonlinear disturbances could actually enhance the signal analysis properties of a system. A classic example is the proposal that a shift in the visual attention’s spotlight could be introduced by this (or another) feedback loop. Later on, data became available which showed that the corticothalamic feedback loop also changes the spatiotemporal resolution of thalamic cell responses in a nonlinear way. In a model, we predicted that also in the cortex, this feedback loop is actively involved in the process of locally enhancing the visual resolution. All this has been known or suggested for some time. But how can one measure the eﬀects of a closed loop in the cortex itself? From an engineering perspective, it would be ideal to compare the normal (closedloop) situation with the so called the ”openloop” condition. This has been so far impossible to investigate in the corticothalamic system. It would require disentangling cells and ﬁbers and eliminating speciﬁcally those from which the feedback arises. By means of a novel, complicated, 3step experimental protocol we are now able to do this for the ﬁrst time. These experiments indeed support the prediction that the closedloop corticothalamic system enhances cortical cell responses without loosing the spatial precision of their receptive ﬁelds.
Enhanced Scene Analysis
15
The results presented here are mainly intended to provide a proof of concept for the underlying ideas. Obviously, it is very hard to try to ﬁnd unequivocal experimental support for this. Studies concerning the action of the corticothalamic feedback loop performed by many groups are probably gradually reaching a state where  despite a lack of many details  the conclusions start to converge. In addition, these studies show that it is sometimes possible to trace seemingly unaddressable model predictions by designing dedicated experiments. Thus, adopting the pessimist’s view, it does not seem to be entirely hopeless to use such proof of concept models as a step in trying to understand brain function. Those who feel that this statement is still too frustrating may ﬁnd consolation in the fact that ideas such as those put forward here have already often been successfully implemented in technical systems.
Acknowledgements The Support of the DFG and the HFSP are gratefully acknowledged.
This page intentionally left blank
Data Analysis and Pattern Recognition
This page intentionally left blank
Fast Curvature MatrixVector Products Nicol N. Schraudolph Institute of Computational Sciences, Eidgen¨ ossische Technische Hochschule, CH8092 Z¨ urich, Switzerland
[email protected] Abstract. The GaussNewton approximation of the Hessian guarantees positive semideﬁniteness while retaining more secondorder information than the Fisher information. We extend it from nonlinear least squares to all diﬀerentiable objectives such that positive semideﬁniteness is maintained for the standard loss functions in neural network regression and classiﬁcation. We give eﬃcient algorithms for computing the product of extended GaussNewton and Fisher information matrices with arbitrary vectors, using techniques similar to but even cheaper than the fast Hessianvector product [1]. The stability of SMD [2,3,4,5], a learning rate adaptation method that uses curvature matrixvector products, improves when the extended GaussNewton matrix is substituted for the Hessian.
1
Deﬁnitions and Notation
Network. A neural network with m inputs, n weights, and o linear outputs is usually regarded as a mapping R m → R o from an input pattern x to the corresponding output y, for a given vector w of weights. Here we formalize such a network instead as a mapping N : R n → R o from weights to outputs (for given inputs), and write y = N (w). To extend this formalism to networks with nonlinear outputs, we deﬁne the output nonlinearity M : R o → R o and write z = M(y) = M(N (w)). For networks with linear outputs, M is the identity. Loss Function. We consider neural network learning as the minimization of a scalar loss function L : R o → R deﬁned as the loglikelihood L(z) ≡ − log Pr(z) of the network output z under a suitable statistical model [6]. For supervised learning, L may also implicitly depend on given targets z ∗ for the network outputs. Formally, the loss can now be regarded as a function L(M(N (w))) of the weights, for a given set of inputs and (if supervised) targets. Jacobian. The Jacobian JF of a function F : R m → R n is the n×m matrix of partial derivatives of the outputs of F with respect to its inputs. For a neural network deﬁned as above, the gradient of the loss with respect to the weights is given by ∂ L(M(N (w))) = JL◦M◦N = JN JM JL , (1) ∂w where ◦ denotes function composition, and the matrix transpose. We use J as an abbreviation for JL◦M◦N . G. Dorﬀner, H. Bischof, and K. Hornik (Eds.): ICANN 2001, LNCS 2130, pp. 19–26, 2001. c SpringerVerlag Berlin Heidelberg 2001
20
Nicol N. Schraudolph
Matching Loss Functions. We say that the loss function L matches the output = Az + b, for some A and b not dependent on w. The nonlinearity M iﬀ JL◦M standard loss functions used in neural network regression and classiﬁcation — sumsquared error for linear outputs, and crossentropy error for softmax or logistic outputs — are all matching loss functions with A = I and b = −z ∗, so = z − z ∗ [6, chapter 6]. This will simplify some of the calculations that JL◦M described in Section 3 below. Hessian. The instantaneous Hessian HF of a scalar function F : R n → R is the n×n matrix of second derivatives of F(w) with respect to its inputs w: HF ≡
∂JF , i.e., ∂w
(HF )ij =
∂ 2 F(w) . ∂wi ∂wj
(2)
For a neural network as deﬁned above, we abbreviate H ≡ HL◦M◦N . The Hessian ¯ is obtained by taking the expectation of H over proper, which we denote H, ¯ ≡ H . For matching loss functions, HL◦M = AJM = J A . inputs: H M x Fisher Information. The instantaneous Fisher information matrix FF of a scalar loglikelihood function F : R n → R is the n × n matrix formed by the outer product of its ﬁrst derivatives: FF ≡ JF JF , i.e.,
(FF )ij =
∂F(w) ∂F(w) . ∂wi ∂wj
(3)
Note that FF always has rank one. As before, we abbreviate F ≡ FL◦M◦N . The Fisher information matrix proper, F¯ ≡ F x , describes the geometric structure of weight space [7] and is used in the natural gradient descent approach [8].
2
Extended GaussNewton Approximation
Problems with the Hessian. The use of the Hessian in secondorder gradient ¯ is not necdescent for neural networks is problematic: for nonlinear systems, H essarily positive deﬁnite, so Newton’s method may diverge, or even take steps in uphill directions. Practical secondorder gradient methods should therefore use approximations or modiﬁcations of the Hessian that are known to be reasonably wellbehaved, with positive semideﬁniteness as a minimum requirement. Fisher Information. One alternative that has been proposed is the Fisher information matrix F¯ [8], which — being a quadratic form — is positive semideﬁnite by deﬁnition. On the other hand, F¯ ignores all secondorder interactions, thus throwing away a lot of potentially useful information. By contrast, we shall derive an approximation of the Hessian that is positive semideﬁnite even though it does retain certain secondorder terms. GaussNewton. An entire class of popular optimization techniques for nonlinear least squares problems — as implemented by neural networks with linear
Fast Curvature MatrixVector Products
21
outputs and sumsquared loss function — is based on the wellknown GaussNewton (aka “linearized”, “outer product”, or “squared Jacobian”) approximation of the Hessian. Here we extend the GaussNewton approach to other standard loss functions — in particular, the crossentropy loss used in neural network classiﬁcation — in such a way that even though some secondorder information is retained, positive semideﬁniteness can still be proven. Using the product rule, the instantaneous Hessian of our neural network can be written as H =
o ∂ (J J ) = J H J + (JL◦M )i HNi , L◦M N L◦M N N ∂w i=1
(4)
where i ranges over the o outputs of N (the network proper), with Ni denoting the subnetwork that produces the ith output. Ignoring the second term above, we deﬁne the extended, instantaneous GaussNewton matrix HL◦M JN . G ≡ JN
(5)
Note that G has rank ≤ o (the number of network outputs), and is positive semideﬁnite, regardless of the choice of network N , provided that HL◦M is. G models the secondorder interactions among the network outputs (via HL◦M ) while ignoring those arising within the network itself (HNi ). This constitutes a compromise between the Hessian (which models all secondorder interactions) and the Fisher information (which ignores them all). For systems with a single, linear output and sumsquared error, G reduces to F ; in all other cases it provides a richer source of curvature information. Standard Loss Functions. For the standard loss functions used in neural network regression and classiﬁcation, G has additional interesting properties: Firstly, the residual JL◦M = z − z ∗ vanishes at the optimum for realizable problems, so that the GaussNewton approximation (5) of the Hessian (4) becomes exact in this case. For unrealizable problems, the residuals at the optimum have zero mean; this will tend to make the last term in (4) vanish in expectation, ¯≈H ¯ near the optimum. so that we can still assume G ¯ Secondly, in each case we can show that HL◦M (and hence G, and hence G) is positive semideﬁnite: for linear outputs with sumsquared loss — i.e., conventional GaussNewton — HL◦M = JM is just the identity I; for independent logistic outputs with crossentropy loss it is diag[z (1 − z)], positive semideﬁnite because (∀i) 0 < zi < 1. For softmax output with crossentropy loss we have H L◦M = diag(z) − zz , which is also positive semideﬁnite since (∀i) zi > 0 and i zi = 1, and thus (∀v ∈ R o ) v [diag(z) − zz ] v = zi vi2 − ( zi vi )2 =
i
zi vi2
i
=
i
i
− 2( zi vi )( zj vj ) + ( zj vj )2
zi (vi −
j
i
j 2
zj vj )
≥ 0.
j
(6)
22
Nicol N. Schraudolph
3
Fast Curvature MatrixVector Products
3.1
The Passes
We now describe algorithms that compute the product of F , G, or H with an arbitrary ndimensional vector v in O(n). They are all constructed from the same set of passes in which certain quantities are propagated through the network in either forward or reverse direction. For implementation purposes it should be noted that automatic diﬀerentiation software tools1 can automatically produce these passes from a program implementing the basic forward pass f0 .
f0 . This is the ordinary forward pass of a neural network, evaluating the function F(w) it implements by propagating activity forward through F.
r 1 . The ordinary backward pass of a neural network, calculating JF u by propagating u backwards through F. Uses intermediate results from the f0 pass.
f1 . Following Pearlmutter [1], we deﬁne the Gateaux derivative Rv (F(w)) ≡
∂F(w + rv) = JF v ∂r r=0
(7)
which describes the eﬀect on a function F(w) of a weight perturbation in the direction of v. By pushing Rv — which obeys the usual rules for diﬀerential operators — down into the equations of the forward pass f0 , one obtains an eﬃcient procedure which calculates JF v from v; see [1] for details and examples. This f1 pass uses intermediate results from the f0 pass.
r 2 . When the Rv operator is applied to the r1 pass for a scalar function F,
one obtains an eﬃcient procedure for calculating the Hessianvector product HF v = Rv (JF ). Again, see [1] for details and examples. This r2 pass uses intermediate results from the f0 , f1 , and r1 passes. 3.2
The Algorithms
The ﬁrst step in all three matrixvector products is the computation of the gradient J of our neural network model by standard backpropagation: Gradient. J is computed by an f0 pass through the entire network (N , M, and L), followed by an r1 pass propagating u = 1 back through the entire network (L, M, then N ). For matching loss functions there is a shortcut: since = Az + b, we can limit the forward pass to N and M (to compute z), JL◦M then r1 propagate u = Az + b back through just N . Fisher Information. To compute F v = J Jv, simply multiply the gradient J by the inner product between J and v. If there is no random access to J or v — i.e., its elements can be accessed only through passes like the above — the 1
See http://wwwunix.mcs.anl.gov/autodiff/
Fast Curvature MatrixVector Products
23
scalar Jv can instead be calculated by f1 propagating v forward through the network. This step is also necessary for the other two matrixvector products. Hessian. After f1 propagating v forward, r2 propagate Rv (1) = 0 back through the entire network to obtain Hv = Rv (J ) [1]. For matching loss functions, the shortcut is to f1 propagate v through just N and M to obtain Rv (z), then ) = A Rv (z) back through N . r2 propagate Rv (JL◦M GaussNewton. Following the f1 pass, r2 propagate Rv (1) = 0 back through L and M to obtain Rv (JL◦M ) = HL◦M JN v, then r1 propagate that back through N , giving Gv. For matching loss functions we do not require an r2 pass: since HL◦M JN = JN JM A JN , G = JN
(8)
we can limit the f1 pass to N , multiply the result with A , then r1 propagate it back through M and N . Alternatively, one may compute the equivalent Gv = AJM JN v by continuing the f1 pass through M, multiplying with A, and JN r1 propagating back through N . Batch Average. To calculate the product of a curvature matrix C¯ ≡ Cx — where C is one of F , G, or H — with vector v, average the instantaneous product Cv over all input patterns x (and associated targets z ∗, if applicable) while holding v constant. For large training sets, or nonstationary streams of ¯ by averaging over “minibatches” of data, it is often preferable to estimate Cv (typically) just 5–50 patterns. 3.3
Computational Cost
Table 1 summarizes the curvature matrix C corresponding to various gradient methods, the passes needed (for a matching loss function) to calculate both the ¯ and the associated gradient J¯ ≡ J x and the fast matrixvector product Cv, computational cost in terms of ﬂoatingpoint operations (ﬂops) per weight and pattern in a multilayer perceptron. These ﬁgures ignore certain optimizations — e.g., not propagating gradients back to the inputs — and assume that any computation at the network’s nodes is dwarfed by that required for the weights. ¯ Table 1. Passes needed to compute gradient J¯ and fast matrixvector product Cv, and associated cost (for a multilayer perceptron) in ﬂops per weight and pattern, for various choices of curvature matrix C. Method C = name I F G H
Pass result: cost:
steepest descent natural gradient GaussNewton Newton’s method
f0 r1 f1 r2 L J u Jv Hv 2 3 4 7 √ √ √ √ √ √√ √ √ √ √ √
Cost (for J¯ ¯ & Cv) 6 10 14 18
24
4
Nicol N. Schraudolph
Application to Stochastic MetaDescent (SMD)
Algorithm. SMD [2,3,4,5] is a new, highly eﬀective online algorithm for local learning rate adaptation. It updates the weights w by the simple gradient descent wt+1 = wt − pt · J ,
(9)
where · denotes elementwise multiplication, and J the stochastic gradient. The vector p of local learning rates is adapted multiplicatively: pt = pt−1 · max( 12 , 1 + µ v t · J ) ,
(10)
using a scalar metalearning rate µ. This update minimizes the network’s loss with respect to p by exponentiated gradient descent [9], but has been relinearized so as to avoid the computationally expensive exponentiation operation [10]. The auxiliary vector v used in (10) is itself updated iteratively via v t+1 = λ v t + pt · (J − λ Cv t ) ,
(11)
where C is the curvature matrix, and 0 ≤ λ ≤ 1 a forgetting factor for nonstationary tasks. Cv t is computed via the fast algorithms described above. Benchmark Setup. We illustrate the behavior of SMD on the “four regions” benchmark [11]: a fully connected feedforward network N with two hidden layers of 10 tanh units each (Fig. 1, right) is to classify two continuous inputs in the range [1,1] into four disjoint, nonconvex regions (Fig. 1, left). We use the standard softmax output nonlinearity M with matching crossentropy loss L, metalearning rate µ = 0.05, initial learning rates p0 = 0.1, and uniformly random initial weights in the range [0.3,0.3]. Training patterns are generated online by drawing independent, uniformly random input samples; they are presented in minibatches of 10 patterns each. Since each pattern is seen only once, the empirical loss provides an unbiased estimate of generalization ability.
j❚i❙h◗fNdIa❇[✼U✰O ✰ ✼ ❇ IN❙◗❚ N ❙◗❚ ❚ np✉  >[✼U✰:a❇IOC86 ❇ ✐ ❥ ❥ ✐ ❦ n p ✉  6 I :5 CO > 48  fNdIa❇[✼h◗U✰OC> ❇ IN ◗ ✰ ✰ ✼ ❇ I ✼ ❇ IN ◗❙ ❇ I N ◗❚❙ N ◗❚❙ ◗❙❚  ❙❚ ✞  ❚ ✞ ✞ ✰ ✼✰ ✰ ✼ ❇ IN✼ ◗ ❇ IN ◗ ❇ n I N ◗ n N p ◗ n p ✉ ◗  n p ✉  n ✞p ✉  ✞ ✞ ✐ ✰ ✼✰ ✰✐ ✼ ❥ ❇ ✼ ✐ ❇❥❦ ❇✐ ❥❦n ✐ ❥❦n p ✐ ❦❥n p ✉ ✐ ❦❥n p ✉  ❦n ✞p ✉  ✞ ✞ ✰ ✰ ✼ ✼ ❇ I  ❇ I  N ✞ I N ✞ ◗ N ◗ ❙ N n ◗ ❙ ❚ n ◗ ❙ ❚ p ✰ n ❙ ❚ ✰ p n ✼ ❙ ❚ p ✉ ✼ ✐ ❚ p ❇ ✉ ✐ I ❚  ❇ ❥ ✉ ✐ I  N ❥ ✞ ✐ I N ❥ ❦ ✞ ◗ ✐ N ❥ ❦ ◗ N ✐ ❥ ❦ n ◗ ❥ ❦ n ◗ p ✰ ❦ n ✰ p n ✼ p ✉ ✼ p ❇ ✉ ❇  ✉  ✞ ✞ ✰  ✰  ✼  n ✼ n ✞ ❇ ✞ n ❇✐ p n ❇ I ✐ p n I ✐ p ✉ I ❥ N ✐ p ✉ I N ❥ ✐ ✉ ◗ N ❥ ❦ ✰✐ ✉  ✰ ◗ N ❥ ❦  ◗ ❙ ❥ ✼ ❦  n ✼ ◗ ❙ ❦ n ✞ ❇ ❙ ✞ ❚ ❦ n ❇ ❙ ❚ p n ❇ I ❙ ❚ p n I ❚ p ✉ I N p ❚ ✉ I N ✉ ◗ N ✰ ✉  ✰ ◗ N  ◗ ✼  ✼ ◗ ✞ ❇ ✞ ❇ ❇  n ✐  n ✐ n ✐ ✞ p ❥✐ ✞ p ✼ ❥ ✐ p ❥ ✉❥ ❦ ❇ ✉ ❥ ❦ ❦  ❦ n I  n n N ✞ p ✞N p ✼ ◗ p ◗ ✉ ◗ ❇ ✉ ❙ ❙  ❙ ❚ I  ❚ ❚ N ✞ ❚ ✞N ✼ ◗ ◗ ◗ ❇ j❚i❙h◗fNdIa❇[✼U✰O ✰ ✼ ❇ IN◗❙❚ N ❙◗❚ ❚ i❙h◗IfNU✰da❇j❚[✼O ✰ ✼ ❇ IN ❙◗❚ N ❙◗❚ ✞ ❚ h◗fNi❙CdIU✰a❇O[✼ ✰ ✼ ❇ IN ◗❙ N ◗❙  ✞ fNh◗dI>a❇U✰[✼OC ✰ ✼ ❇ IN ◗ N ◗ ✉  ✞ IfNda❇:[✼U✰O>C ✰ ✼ ❇ IN p N p✉  ✞ CIda❇[✼8U✰O:> ✰ ✼ ❇ I n p np✉  ✞ >ICa❇[✼U✰6O8: ✰ ✼ ❇ ❦n p ❦np ✉  ✞ :C>U✰I[✼O568 ❥ ✰ ✼ ❥❦n p ❥❦np ✉  ✞ 8>:CU✰OI456✐ ❥ ✰ ✐❥❦n p ❥✐❦np✉  ✞ 6:8>OCI45 ✰ ✰ ✼ ❇ I ✼ ❇ IN ◗❙ ❇ I N ✰ ◗❚❙ ✰ ✞✼ N ❇ I◗✞❚❙✼ ❇ I N◗❙❚ ◗❙ ❇ I N ✰❙❚ ◗❙❚ ✰ ✞✼ N ❇ ❚I◗❙✞❚✼ ❇ ✉I N◗❙❚ ◗❙ ❇ ✉I N ✰❙❚ ◗❙ ✰ ✞✼ ✉ N ❇ p ❚I◗❙✞✼ ❇ p ✉IN ◗❙ ◗ ❇ pn ✉I N ✰❙ ◗ ✰ ✞✼ n✉ N ❇p I◗✞✼ n❇❦ p ✉IN◗ ❇ n ❦ pn ✉I N ✰ ✰ ✞✼❦n✉❥ N ❇p I✞✼ ❦n❇❥❦ p ✉I ❇✐ ❦n ❥❦ pn ✉I ✰  ✰✐ ✞❥✼❦n✉❥ ❇p ✞✼ ✐ ❥❦n❇❥❦ p ✉ ❇✐ ❦n ❥❦np ✉ ✰  ✰✐ ✞❥✼❦n✉  p ✞✼ ✐ ❥❦n p ✉  ❦n p ✉ ✰  ✰ ✞ ✉  ✞ ✰ ✼ ✞ ✼ ✞ ❇ I ✰  ❇ I  N ✼ ✞ ✉ ✼I N ✞ ◗ ❇ ✉ p N I ✰ ❇  ◗ ❙ ✉ p N I  N ✼ n ◗ ❙ ❚ p ✉ ✞ ✼I N n ◗ ❙ ✞ ❚ ◗ p ❇ ✉ N p I ✰ n ❦ ❙ ❇  ❚ ◗ ❙ ✉ N p I  N n ✼ ❦ ❚ n ❥ ◗ ❙ ❚ p ✞ ✉ I ✼ N ❦ ❚ n ❥ ◗ ❙ ✞ ❚ ◗ p ❇ ✉ ✐ N p I ✰ ❦ n ❥  ❦ ❙ ❇ ❚ ◗ ❙ ✉ N p ✐ I  N n ❥ ✼ ❦ ❚ n ❥ ◗ ❙ p ✞ ✉ ✐ I ✼ N ❥ ❦ ❚ n ❥ ◗ ❙ ✞ ◗ p ❇ ✉ ✐ p N I ✰ ❦ n ❥  ❦ ❙ ❇ ◗ ✉ p N ✐ I  N n ❥ ✼ ❦ n ◗ p ✞ ✉ I ✼ N ❥ ❦ n ◗ ✞ p ❇ ✉ p N I ✰ ❦ n  ❇ ✉ p N I  n ✼ p ✉ ✞ ✼ I ✞ p ❇ ✉ ✰  ❇ ✉  ✼ ✞ ✼ ✞ ✰ ✰ ✞ ✰  ✞  ✉ ✼  ✼ ✉ p ✰ n ✞ ✰ ✉ ❇ p ✞ n ❦ ❥  ❇ p ✉ ✼ n ❦  ❥ ❇ I ✐ ✼ ✉ p n ❦ ✰ ❥ n I ✞ ✰ ✐ ✉ ❇  p ✞ ❦ I ❥ n ❦ ❥ N  ❇ ✐ p ✉ ❦ ✼ n ❥ ❦ N ❇  ❥ I ✐ ✼ ✉ p ◗ N n ❥ ❦ ✰ ❥ n I ✞ ✰ ✐ ✉  ❇ p ✞ ◗ ❙ ❦ I ❥ n ❦ N  ❇ ✐ p ◗ ✉ ❙ ❦ ❥ ✼ n ❦ N  ❇ I ✼ ✉ ◗ ❙ p ❚ ◗ N ❥ n ❦ ✰ n I ✞ ✰ ✉  ❇ p ❙ ✞ ◗ ❚ ❙ ❦ I n N  ❇ p ❙ ❚ ◗ ✉ ❙ ❦ n ✼ N  ❇ I ✼ ✉ ◗ ❚ ❙ p ❚ ◗ N n ✰ I ✞ ✰ ✉  ❇❚ p ❙ ✞ ❚ ◗ ❙ I N  ❇ p ❚ ❙ ❚ ◗ ✉ ❙ ✼ N  ❇ I ✼ ✉ ◗ ❚ ❙ ◗ N ✰ I ✞ ✰ ✉ ❇ ❚ ❙ ✞ ◗ I N  ❇ ❚ ❙ ◗ ✼ N ❇ I ✼ ◗ N ✰ I ✞ ✰ ❇ ✞ I ❇ ✼ ❇ ✼ ✰ ✰ ✞  ✉ p n ❦ ❥ ✐ ✞  ✉ p n ❦ ❥ ✐ ✰ ✉ p n ❦ ❥ ✐ p ✞  n ❦ ✉ ❥ p n ❦ ❥ ✐ ✞  ✉ p❦ ✼ ❥ n ❦ ❥ ✐ ✰ ✉ p n ❦ ❥ ✐ p ✞  n ❦ ✉ ❥ p n ❦ ✞ ❇  ✉ ❦ p ✼ ❥ n ❦ ✰ ✉ p n ❦ I p ✞  n ❦ ✉ p n ✞ I  ❇ ✉ ❦ p ✼ n ✰ N ✉ p n N I p ✞  n ✉ p ✞ N I  ❇ ✉ p ✼ ◗ ✰ N ✉ p ◗ N I p ✞  ✉ ◗ ✞ N ❙ I  ❇ ✉ ✼ ❙ ◗ ✰ N ✉ ❙ ◗ N I ✞  ❙ ◗ ✞ ❚ N ❙ I  ❇ ✼ ❚ ❙ ◗ ✰ N ❚ ❙ ◗ N I ✞ ❚ ❙ ◗ ✞ ❚ N ❙ I ❇ ✼ ❚ ❙ ◗ ✰ N ❚ ❙ ◗ N I ❚ ❙ ◗ N I ❇ ✼ ✰ h◗a❇ ❇ ◗ ◗ fN[✼ ✼ N N dIU✰ ✰ I a❇O ❇ I[✼ ✼ ✞ CU✰ ✰  >O ✉ :I p p ✞ 8C n n  6> ❇ ❇ ◗ ◗ ✼ ◗ N N ✰ N I I I ❇ ❇ ✼ ✞ ✰  ✉ ✉ ✉ p p p n ✞ n n   ❇ ❇ ✼ ✼ ◗ ◗ ✰ ✰ ◗ N N I I ❇ ✞❇ ✞ ✼ ✼  ✉✰ ✰ ✉ p p n n n ✞ ✞   ❇ ❇ ✼ ✼ ✰ ◗ ◗ N ◗ N ✞ N ✞ I  I ❇ ✉ ❇ ✉✼ p ✼ p n ✰ p n n ✞ ✞   ❇ ❇ ✼ ✰ ✰ ✞  ◗ ✉  ◗ ✉ p N n ✉ ◗ p N nI ◗ p N I n ❇ I n ❇ ✼ ✰ ✰ ✞   ❇ ✼ ✞  ✉pn ✞  ✉ p n p n ◗ N ◗N I ❇ ✼ ✞  ✞ 
x
y
Fig. 1. The four regions task (left), and the network we trained on it (right).
Fast Curvature MatrixVector Products
1.5
1.5
1.5
H
1.0 0.5
25
F
1.0
G
1.0 0.5
0.5
0.0
0.0 0k 2k 4k 6k 8k 10k
0k 2k 4k 6k 8k 10k
0k 2k 4k 6k 8k 10k
Fig. 2. Loss curves for 25 runs of SMD with λ = 1, when using the Hessian (left), the Fisher information (center), or the extended GaussNewton matrix (right) for C in Equation (11). Vertical spikes indicate divergence.
Curvature Matrix. Fig. 2 shows loss curves for SMD with λ = 1 on the four regions problem, starting from 25 diﬀerent random initial states, using the Hessian, Fisher information, and extended GaussNewton matrix, respectively, for C in Equation (11). With the Hessian (left), 80% of the runs diverge — most of them early on, when the risk that H is not positive deﬁnite is greatest. When we guarantee positive semideﬁniteness by switching to the Fisher information matrix (center), the proportion of diverged runs drops to 20%; those runs that still diverge do so only relatively late. Finally, for our extended GaussNewton approximation (right) only a single run diverges, illustrating the beneﬁt of retaining certain secondorder terms while preserving positive semideﬁniteness. Stability. The residual tendency of SMD to occasionally diverge can be suppressed further by slightly lowering the λ parameter. By curtailing the memory of iteration (11), however, this can compromise the rapid convergence of SMD, resulting in a stability/performance tradeoﬀ (Fig. 3): 0.5 C – H G G
0.4 0.3
λ 0 0.95 0.998 1
0.2 0.1 0.0 1k
2k
5k
10k
20k
Fig. 3. Average loss over 25 runs of SMD for various combinations of curvature matrix C and forgetting factor λ. Memory (λ → 1) is key to rapid convergence.
26
Nicol N. Schraudolph
With the extended GaussNewton approximation, a small reduction of λ to 0.998 (solid line) is suﬃcient to prevent divergence, at a moderate cost in performance relative to λ = 1 (dashed). When the Hessian is used, by contrast, λ must be set as low as 0.95 to maintain stability, and convergence is slowed much further (dashdotted). Even so, this is still signiﬁcantly faster than the degenerate case of λ = 0 (dotted), which in eﬀect implements IDD [12], the to our knowledge best competing online method for local learning rate adaptation. From these experiments it appears that memory (i.e., λ close to 1) is key to achieving the rapid convergence characteristic of SMD. We are now investigating other, more direct ways to keep iteration (11) under control, aiming to ensure the stability of SMD while maintaining its excellent performance at λ = 1. Acknowledgment.. We would like to thank Jenny Orr and Barak Pearlmutter for many helpful discussions, and the Swiss National Science Foundation for the ﬁnancial support provided under grant number 2000–052678.97/1.
References 1. B. A. Pearlmutter, “Fast exact multiplication by the Hessian,” Neural Computation, vol. 6, no. 1, pp. 147–160, 1994. 2. N. N. Schraudolph, “Local gain adaptation in stochastic gradient descent,” in Proc. 9th Int. Conf. Artiﬁcial Neural Networks, pp. 569–574, IEE, London, 1999. 3. N. N. Schraudolph, “Online learning with adaptive local step sizes,” in Neural Nets – WIRN Vietri99: Proc. 11th Italian Workshop on Neural Networks (M. Marinaro and R. Tagliaferri, eds.), Perspectives in Neural Computing, (Vietri sul Mare, Salerno, Italy), pp. 151–156, Springer Verlag, Berlin, 1999. 4. N. N. Schraudolph, “Fast secondorder gradient descent via O(n) curvature matrixvector products,” Tech. Rep. IDSIA1200, IDSIA, Galleria 2, CH6928 Manno, Switzerland, 2000. Submitted to Neural Computation. 5. N. N. Schraudolph and X. Giannakopoulos, “Online independent component analysis with local learning rate adaptation,” in Adv. Neural Info. Proc. Systems (S. A. Solla, T. K. Leen, and K.R. M¨ uller, eds.), vol. 12, pp. 789–795, The MIT Press, Cambridge, MA, 2000. 6. C. M. Bishop, Neural Networks for Pattern Recognition. Oxford: Clarendon, 1995. 7. S.i. Amari, DiﬀerentialGeometrical Methods in Statistics, vol. 28 of Lecture Notes in Statistics. New York: Springer Verlag, 1985. 8. S.i. Amari, “Natural gradient works eﬃciently in learning,” Neural Computation, vol. 10, no. 2, pp. 251–276, 1998. 9. J. Kivinen and M. K. Warmuth, “Additive versus exponentiated gradient updates for linear prediction,” in Proc. 27th Annual ACM Symp. Theory of Computing, (New York, NY), pp. 209–218, Association for Computing Machinery, 1995. 10. N. N. Schraudolph, “A fast, compact approximation of the exponential function,” Neural Computation, vol. 11, no. 4, pp. 853–862, 1999. 11. S. Singhal and L. Wu, “Training multilayer perceptrons with the extended Kalman ﬁlter,” in Adv. Neural Info. Proc. Systems: Proc. 1988 Conf. (D. S. Touretzky, ed.), pp. 133–140, Morgan Kaufmann, 1989. 12. M. E. Harmon and L. C. Baird III, “Multiplayer residual advantage learning with general function approximation,” Tech. Rep. WLTR1065, Wright Laboratory, WL/AACF, 2241 Avionics Circle, WrightPatterson AFB, OH 454337308, 1996.
Architecture Selection in NLDA Networks Jos´e R. Dorronsoro, Ana M. Gonz´ alez, and Carlos Santa Cruz Dept. of Computer Engineering and Instituto de Ingenier´ıa del Conocimiento Universidad Aut´ onoma de Madrid, 28049 Madrid, Spain
Abstract. In Non Linear Discriminant Analysis (NLDA) an MLP like architecture is used to minimize a Fisher’s discriminant analysis criterion function. In this work we study the architecture selection problem for NLDA networks. We shall derive asymptotic distribution results for NLDA weights, from which Wald like tests can be derived. We also discuss how to use them to make decisions on unit relevance based on the acceptance or rejection of a certain null hypothesis.
1
Introduction
Non Linear Discriminant Analysis (NLDA) is a novel feature extraction procedure proposed by the authors in which a Multilayer Perceptron (MLP) architecture is used but where optimal weights are selected by minimizing a Fisher discriminant criterion function. More precisely, in this work we shall consider two class problems and the simplest possible such architecture, having D input units, a single hidden layer with H units and a single, linear output unit (C − 1 outputs are used in Fisher’s linear discriminants for a C class problem). We denote D dimensional network inputs as X = (x1 , . . . , xD )t (At denotes the transpose of A), the weights connecting the D inputs with the hidden unit H H t , . . . , wDh ) and the weights connecting the hidden layer to h as WhH = (w1h the single output by W O . We assume that xD ≡ 1 for bias eﬀects. The network transfer F (X, W ) function is thus given by y = F (X, W ) = (W O )t O, with O = O(X, W ) = (o1 , . . . , oH )t the outputs of the hidden layer, that is, oh = f ((WhH )t · X), and f denoting the sigmoidal function. In other words, y = F (X, W ) performs a standard MLP transformation. The diﬀerence lies in the selection of the optimal weights W∗ = (W∗O , W∗H ), for which the criterion function J(W ) = sT /sB is to be minimized. Here sT = E[(y − y)2 ] is the total covariance of the output y and sB = π1 (y 1 − y)2 + π2 (y 2 − y)2 the between class covariance. By y = E[y] we denote the overall output mean, y c = E[yc] denotes the output class conditional means and πc the class prior probabilities. NLDA networks allow a general and ﬂexible use of any other of the criterion functions customarily employed in linear Fisher analysis. In [1,7] other properties of NLDA networks are studied, and it is shown that the features they provide can be more robust than those obtained through MLPs in problems where there is a large class overlapping and where class sample sizes are markedly unequal.
With partial support from Spain’s CICyT, grant TIC 98–247.
G. Dorﬀner, H. Bischof, and K. Hornik (Eds.): ICANN 2001, LNCS 2130, pp. 27–32, 2001. c SpringerVerlag Berlin Heidelberg 2001
28
Jos´e R. Dorronsoro, Ana M. Gonz´ alez, and Carlos Santa Cruz
As with MLPs, optimal NLDA network architectures have to be somehow decided upon. We shall analyze here how this can be done. In our setting we have to decide on the number of hidden units. If the relevance of the original features is also to be decided, the number of input units cannot be taken to be ﬁxed in advance and may have to be lowered. Architecture selection is a long studied question for MLPs, and many approaches, such as bootstrapping, pruning, regularization, stopped training or information criteria have been proposed (see [3,6,8] for an overview of these approaches). A common starting point for several of these methods can be found in the asymptotic theory of network weights. We shall derive asymptotic distribution results for the wdh weights connecting the input and hidden layers, from which we will obtain a statistic for weight relevance. For the linear weights wh between the hidden units and the output we shall adapt some statistics used in classical Fisher analysis to determine the signiﬁcance of discrimination features. These results are given in the next section, while the third one contains an illustration of these procedures on a synthetic problem.
2
Unit Relevance in NLDA Networks
We shall study unit relevance, working with sets of weights related to a concrete unit. To do so we could consider either input or output weights. However, and as it happens with MLPs, NLDA networks may present identiﬁability problems. For instance, if all the weights wkl starting at a given hidden unit k are zero, the weights whk leading to that unit can take any value without aﬀecting network outputs. To avoid these eﬀects we shall consider what we could term “output” unit relevance, that is, the joint relevance of the weights leaving a unit. For a two class NLDA network and a D × H × 1 architecture, the relevance of the hidden unit h is thus given in terms of the weight whO , 1 ≤ h ≤ H, while that of an H input unit d, 1 ≤ d ≤ D, is jointly determined by the weight set wdh , 1 ≤ h ≤ H. Notice that the non relevance of an input unit has to be interpreted as the lack of relevance of the associated feature, that can thus be deleted without any loss of discriminating information. Each relevance has to be dealt with separately and we begin with the input units, for which we shall derive a Wald like test. This requires to obtain the asymptotic distribution of these weights. The starting point is the following result [4]. Theorem 1. Let Ψ (X, W ) be a vector valued function depending on a random variable X and a parameter vector W such that ∇W Ψ (X, W ) and ∇2W Ψ (X, W ) exist for all X and are bounded by integrable functions. If E[Ψ (X, W∗ )] = 0 at an isolated point W∗ and the matrices H∗ = H(W∗ ) = E[∇W Ψ (X, W∗ )], I ∗ = I(W∗ ) = E[Ψ (X, W∗ )Ψ (X, W∗ )t ] are respectively deﬁnite positive and non singular. Then, N if Xn is a sequence of i.i.d. random vectors distributed as X, the equation 1 Ψ (Xn , W ) = 0 has a √ n − n converging in probability to W∗ and such that N (W solution sequence W
Architecture Selection in NLDA Networks
29
W∗ ) converges in distribution to a multivariate normal with 0 mean and variance C ∗ = C(W∗ ) = (H∗ )−1 I ∗ (H∗ )−1 . If S is now a 0–1 selection vector with a given number K of 1’s, then, under √ N → N (0, S t C ∗ S) in the null hypothesis H0 : SW∗ = 0, it follows that N S W √ N condistribution. Therefore, the random variable ζN = N (S t C ∗ S)−1/2 S W verges to a K dimensional N (0, I) and N )t (S t CS)−1 (S W N ) ζN 2 = N (S W converges in distribution to a chi–square χ2K with K degrees of freedom. We shall apply this result to NLDA network architecture selection. Optimal weights are obtained iterating a two step procedure. Assuming that we have the O values (WtO , WtH ) at step t, we ﬁrst obtain Wt+1 applying standard linear Fisher H O analysis to the outputs O = O(X, Wt ) of the hidden layers. Next, with Wt+1 H H ﬁxed, Wt+1 is obtaining by numerically minimizing of J as a function J(W ) = O J(Wt+1 , W H ) of the W H . We derive ﬁrst asymptotic results for the hidden weights W H , obtaining Ψ from the gradient with respect to these weights of the network criterion function J, which we compute next. Since the output weights W O are computed as in Fisher’s linear discriminant analysis, they are therefore invariant with respect to translations in the last hidden layer. In particular, we can center around zero the last hidden layer outputs without aﬀecting the W O or the value of J. We will thus assume that at the last hidden layer, O = 0, which implies that the network outputs y will have zero mean as well, for y = (W O )t O. Under these simpliﬁcations, we therefore have sT = E[y 2 ] and sB = π1 (y 1 )2 + π2 (y 2 )2 . It now follows that ∂sT ∂y = 2wl E [yf (al )xk ] ; = 2E y ∂wkl ∂wkl ∂sB ∂y = 2wl =2 πc y c Ec πc y c Ec [f (al )xk ] , ∂wkl ∂w kl c c where we recall that Ec [z] = E[zc] denotes class conditional expectations. Therefore, considering now the W O ﬁxed and viewing therefore J as a function J(W H ) of only the W H weights, it can be shown that 2wl ∂J E [yf (al )xk ] − J = πc y c Ec [f (al )xk ] . (1) ∂wkl sB Let us now deﬁne Ψ = (Ψkl ) as Ψkl (X, W ) = wl yf (al )xk − λkl (W ) = ) − λkl (W ), where the nonrandom term λkl (W ) is given by λkl (W ) = zkl (X, W wl J(W ) c πc y c Ec [f (al )xk ]. At an isolated minimum W∗ of J, (1) shows that E[Ψkl (X, W∗ )] =
s∗B ∂J (W∗ ) = 0, 2 ∂wkl
with s∗B = sB (W∗ ); in particular, E[zkl (·, W∗ )] = λkl (W∗ ). Moreover, at W∗ ∂2J sB ∂J ∂ s∗ ∂Ψkl ∗ (W∗ ) = B H =E (X, W∗ ) = (W∗ ), ∂wmn ∂wmn 2 ∂wkl 2 ∂wmn ∂wkl
30
Jos´e R. Dorronsoro, Ana M. Gonz´ alez, and Carlos Santa Cruz
and, hence, H∗ = E[∇W Ψ (X, W∗ )] is deﬁnite positive. The second partials at a minimum W∗ of J can be computed using (1) and their value is ∂2J (W∗ ) ∂wmn ∂wkl 2wl = ∗ {E[f (an )xm f (al )xk ] + δln E[yf (al )xm xk ]} sB 2wl πc (Ec [f (an )xm ]Ec [f (al )xk ] + δln y c Ec [f (al )xm xk ]) . − ∗ J∗ sB Here J ∗ = J(W∗ ) and δln denotes Kronecker’s δ. We also have (I ∗ )(kl)(mn) = E[Ψkl (X, W∗ )Ψmn (X, W∗ )t ] = E[zkl (X, W∗ )zmn (X, W∗ )] − λkl (W∗ )λmn (W∗ ). Therefore (I ∗ )(kl)(mn) = cov(zkl zmn ), and if it is non singular, it follows that for N of N random variables X1 , . . . , XN , i.i.d. as X, there is a solution sequence W N N 1 j j j 1 j zkl (X, W ) − λkl (W ) = y dl xk − λkl (W ) = 0, N j=1 N j=1
(2)
√ N − W∗ ) converges in that converges in probability to W∗ and such that N (W ∗ ∗ distribution to a multivariate normal N (0, C ), where C = (H∗ )−1 I ∗ (H∗ )−1 . We deﬁne the relevance Rd of the d input unit by choosing now a selection vector S = (s(pq) ), 1 ≤ p ≤ D, 1 ≤ q ≤ H, where s(dq) = 1 and all the others are zero, and setting N )t (S CS) ˆ −1 (S W N ) N Rd = N (S W
h
∗ 2 (w ˆdh ) . ∗ −1 ∗ ∗ ((H ) I (H )−1 )(dh)(dh)
(3)
We can thus decide to keep or remove the d–th feature according to the chosen conﬁdence level of a χ2H distribution with H degrees of freedom when computed on the value Rd . Gradient and hessian computations can also be derived for the linear output O W O , considering now weights. In fact, since sT = (W O )t STO W O , sB = (W O )t SB O O O J as a function J(W ), it is easy to see that ∇W J = 2(ST − JSB )W O /sB , and 2 O O ∗ it also follows that ∇W J(W∗ ) = 2(ST − JSB )/sB , with again s∗B = sB (W∗ ). However, it is clear that ∇2W J(W∗ ) is singular, for the optimal W∗O is an eigenO and the preceding arguments leading to asymptotic weight value of STO − J ∗ SB distributions will not hold for the W O . As an alternative we can consider input relevance tests used in classical Fisher analysis. Let W O = (S O )−1 (O1 − O2 ) be O −1 the optimal Fisher weights, D2 = (O1 − O2 )t (SW ) (O1 − O2 ) and T = (STO )−1 be the inverse of the total covariance matrix of the hidden layer outputs. Then it can be shown for two class problems [5] that if the inputs come from two d– dimensional multinormal distributions with a common covariance, the statistic Rh =
(N − d − 1)c2 wh2 , (N − 2)Thh (N − 2 + c2 D2 )
(4)
Architecture Selection in NLDA Networks
31
with c2 = N1 N2 /N , Ni the number of sample patterns in class i and N = N1 + N2 , follows an F1,N −2 distribution under the null hypothesis wh = 0. In our context, the values of the last hidden outputs would be taken as the inputs of a classical Fisher transformation. Of course, the preceding common multinormal distribution assumption may not be true for these hidden layer outputs and, therefore, tests for signiﬁcance at a given level may not be strictly applicable. However, a value of the statistic that would warrant the rejection of the null hypothesis could be taken as an indicator of that weight’s relevance. On the other hand, a value of the statistic compatible with the acceptance of the null hypothesis should indicate the convenience of that weight removal.
3
A Numerical Illustration
We will illustrate the preceding techniques on a toy synthetic problem with 2 unidimensional classes. The ﬁrst one, C0 , follows a N (0, 0.5) distribution and the other one, C1 , is given by a mixture of two gaussians, N (−2, 0.25) and N (2, 0.25). The prior probabilities of the two classes are 0.5. As a classiﬁcation problem, it is an “easy” one, for the mean error probability (MEP) of the optimal Bayes classiﬁer has a rather low value of about 0.48 %. On the other hand, the class distributions are not unimodal, and neither linearly separable. NLDA networks will thus realize feature enhancing rather than feature extraction. We shall work
H
of I, H and we denote by w ˆdh the sample with the sample versions I, ˆh , w derived weights. We start with a minimal network with all units being relevant and shall “grow” the network’s hidden layers. That is, we will add new units in a step by step fashion and stopping if the last added unit fails the corresponding test. Once we arrive at the optimal number of hidden units, we shall prune input units also in a step by step fashion, removing each time the feature found to be less relevant. It is easily seen that the optimal architecture for this problem just needs 2 hidden units: one hidden unit is not enough for neither MLPs nor NLDA networks (although it could be if, for instance, f (x) = x2 is used instead of the sigmoidal) while with 2 units the MEP of the δN LDA classiﬁer coincides with the optimal Bayes value of 0.48. To illustrate input feature selection we shall consider 3–dimensional features, where we add two noisy inputs. The ﬁrst one is given by a uniform distribution on [−8, −12] , and the second by a N (10, 1) normal. As just explained, we start with a minimal 3 × 1 × 1 network and grow it using the statistic Rh to detetect when an irrelevant hidden unit has been added. The ﬁrst half of table 1 shows the values of R(h) for networks with 2 and 3 hidden units and what would be the probabilities of obtaining the statistic’s value assuming the null hypothesis to be true. It is clear from that table that all hidden units are relevant in networks with 2 such units. A third unit is not, however, relevant: the corresponding value of the statistic Rh is 0.631 and the probability of getting that value or higher under H0 is about 0.85. This shows the optimal number of hidden units to be 2. Once we have arrived at the
32
Jos´e R. Dorronsoro, Ana M. Gonz´ alez, and Carlos Santa Cruz
Table 1. Typical hidden and input unit relevance values and probabilities of getting such a value or higher under the null hypothesis. No. Hidden U. 2 9.14 104 3 7.41 104 No. Inputs 3 5.84 103 2 6.01 103
Hidden unit relevance Rh 9.07 104 8.90 104 0.63 Input unit relevance Rd 10.2 4.20 1.07
0 0
H0 Prob. 0 0 0.85
0 0
H0 Prob. 0.06 0.13 0.58
3×2×1 architecture optimal at this point, we will study input feature relevance, removing step by step those features found not to be relevant. This is done using the statistic (3) which now approximately follows a χ22 distribution. The second half of table 1 shows the values of Rd for networks with 3 and 2 input units and the corresponding probability levels under the null hypothesis. The 3–input row shows that the null hypothesis can be very safely rejected for the ﬁrst input, while it would be rejected at the 6 % level for the second input and at a relatively high 13 % level for the third input. Observe also that the ﬁrst input relevance is about 1000 times bigger than those of the other inputs. If the third feature (gaussian inputs) is rejected and its input unit removed, the 2–input row shows that H0 for the second feature (uniform inputs) would have to be rejected at a clearly much too high 58 % level, while the null hypothesis can be rejected for the ﬁrst input: under H0 , the probability of getting such a value for the statistic is again 0. This yields the ﬁnal optimal 1 × 2 × 1 architecture.
References 1. J. Dorronsoro, F. Ginel, C. S´ anchez, C. Santa Cruz, “Neural Fraud Detection in Credit Card Operations”, IEEE Trans. in Neural Networks 8 (1997), 827–834. 2. K. Fukunaga, Introduction to Statistical Pattern Recognition, Academic Press, 1972. 3. R. Golden, Mathematical Models for Neural Network Analysis and Design, MIT Press, 1996. 4. E.B. Manoukian, Modern Concepts and Theorems of Mathematical Statistics, Springer, 1986. 5. K.V. Mardia, J.T. Kent, J.M. Bibby, Multivariate Analysis, Academic Press, 1979. 6. B.D. Ripley, Pattern Recognition and Neural Networks, Cambridge U. Press, 1996. 7. C. Santa Cruz, J.R. Dorronsoro, “A nonlinear discriminant algorithm for feature extraction and data classiﬁcation”, IEEE Transactions in Neural Networks 9 (1998), 1370–1376. 8. H. White, “Learning in artiﬁcial networks: a statistical perspective”, Neural Computation 1 (1989), 425–464.
Neural Learning Invariant to Network Size Changes Vicente Ruiz de Angulo and Carme Torras Institut de Robòtica i Informàtica Industrial (CSICUPC), Parc Tecnològic de BarcelonaEdifici U, Llorens i Artigas 46, 08028Barcelona, Spain
[email protected],
[email protected] Abstract. This paper investigates the functional invariance of neural network learning methods. By functional invariance we mean the property of producing functionally equivalent minima as the size of the network grows, when the smoothing parameters are fixed. We study three different principles on which functional invariance can be based, and try to delimit the conditions under which each of them acts. We find out that, surprisingly, some of the most popular neural learning methods, such as weightdecay and input noise addition, exhibit this interesting property.
1 Introduction This work stems from an observation we made in analyzing the behaviour of a deterministic algorithm to emulate neural learning with random weights. We found that, for a fixed variance greater than zero, there is a number of hidden units above which the learned function does not change, or the change is slight and tends to zero as the size the network grows [7]. Here we study the conditions a neural learning algorithm should satisfy in order to lead to the same function, irrespective of network size. Methods for complexity reduction [1] usually include one parameter (and sometimes more than one) to regulate the simplicity or smoothness imposed on the function implemented by the network. Each method simplifies the network in a way that is supposed to be optimal for the class of functions that is being approximated. Thus, ideally, the optimal level of smoothing should be obtained only by manipulating the abovementioned parameter. Variability caused by other sources must be considered spurious uncertainty. For example, functionally different minima obtained by the same algorithm when running on different architectures or when departing from different initial points are embarrassing for the practitioner, who would desire to be freed from having to optimize the algorithm also along these lines. In particular, the selection of the number of hidden units of the architecture can influence decisively the result and is computationally cumbersome. This motivates the interest in complexity minimization methods that show dependence only on the explicit complexity parameter and not on the size of the chosen architecture. However, there are no claims about functional invariance for the known methods, although Neal [4] devised a prior such that the complete bayesian procedure using it can be considered functionally invariant (see Section 3.1). In what
G. Dorﬀner, H. Bischof, and K. Hornik (Eds.): ICANN 2001, LNCS 2130, pp. 33–40, 2001. c SpringerVerlag Berlin Heidelberg 2001
34
Vicente Ruiz de Angulo and Carme Torras
follows we put forth some theoretical arguments and present some experimental results indicating that functional invariance may be a rather common phenomenon, even in wellknown methods used for a long time by the connectionist community. We also try to delimit the conditions that a complexity reduction method must satisfy in order to yield functional invariance. The paper focuses on the regularization methods for complexity reduction [1] and those that can be made equivalent to them. Regularization consists in adding a penalty function to the error function that regulates the complexity of the implemented network via a multiplicative factor called regularizer coefficient.
2 The Phenomenon: LearnedFunction Invariance We shall first define clearly the phenomenon under study, namely learnedfunction invariance. Let F ( X ,W ) and G ( X ,W ′) denote the inputoutput functions implemented by two feedforward networks having equal number of input and output units, but different number of hidden units, and with weight vectors W and W ′ , respectively. The functional distance between F (•,W ) and G (•,W ′) is defined as
dist (F (•,W ), G (•,W ′)) =
1 ∫ ( F ( X ,W ) − G ( X ,W ′)) 2 dX Vol (Ω (s)) Ω( s)
when s tends to infinity, Ω(s) being a cube of side s in the input space. Now, let M (F , λ ) be the optimum weight vector obtained with network F by a learning method M involving some complexity reduction regulated by the parameter λ . The name “method” is used here to denote an idealized algorithm, usually characterized by the minimization of an objective function, that always find global optima. If M is a regularization method then
M (F , λ ) = argmin C (W ) = argmin E (W ) + λ R (W ) , W
W
where E (W ) is the standard error function and R (W ) > 0 is the regularization term. Finally, let { Fn } be a family of one hiddenlayer architectures differing only in the number n of hidden units. We say that the algorithm M yields functional invariance for the network family { Fn } if dist ( Fi (•, M (Fi , λ )), Fi+1 (•, M (Fi+1, λ ))) tends to zero when i tends to infinity for every λ > 0 . It is necessary to make some remarks about these definitions. First, we only consider global minima of C (W ) in the definition of M (F , λ ) , and not local minima, saddle points or other points resulting from a numerical optimization of C (W ) . Second, all the global minima of C (W ) must be functionally equivalent, or the distance dist ( Fi (•, M (Fi , λ )), Fi+1 (•, M (Fi+1, λ ))) would not be well defined. Obviously it is impossible to fulfill this condition for λ = 0 , but not for λ > 0 . This is related to the explicit exclusion of local minima from the functional invariance definition, since global and local minima, having different values of E , produce different outputs for the training patterns, which implies that the two minima cannot be functionally equivalent. It is possible to extend this definition using families of
Neural Learning Invariant to Network Size Changes
35
architectures that are not limited to one hidden layer of units. The only condition is that the elements of the family can be indexed in such a way that, given an arbitrary precision, for any given functional continuous mapping (from the input to the output space), there exists an index value such that architectures with higher indices can approximate the mapping with that precision. The last remark is that the definitions are independent from the training set, the only implicit requirement being that any nonempty training set would bring the functional distance limit to zero.
(a)
(b)
Fig. 1. The results of training a 4HU network (a), and an 8HU network (b) are shown. Each of the blocks in the figure contains all the weights associated to a hidden unit. Weights are represented by squares, white if their value is positive, and black if their value is negative. The size of the square is proportional to the magnitude of the weight. The top weight in the block is the weight from the hidden unit to the output unit. The weights in the bottom row correspond to the weights incoming to the hidden unit.
Figure 1 shows two networks trained with the same 20 training points, randomly drawn from the function .3x3 + .3x2 + 10/ 3(x+3)2 in the interval [1.5 , 1.5], using a deterministic algorithm to emulate learning with random weights [5, 6]: the mean of the weight distribution is adapted to minimize the average error over the distribution. The complexity of the function implemented by that mean is regulated via the variance of the distribution of the weights. Both the four hiddenunits (HU's) network and the eight HU's network were trained using the same variance. It can be seen that the distance between the two resulting networks is null, i.e., they are functionally equivalent. Clearly, the weights of the first unit in Figure 1(a) are the same as those of the fifth unit in Figure 1(b). Moreover, the fourth unit in Figure 1(a) shows a direct correspondence with the first unit in Figure 1(b): the weights have pairwise the same magnitude, and since the signs of both the input weights and the output weights are inverted, the two units have the same functionality. It can be concluded that the two networks implement the same approximation of the desired function. Applying the algorithm to any network with a large number of hidden units, we obtain the same units in different positions and combinations of sign inversions that produce always the same inputoutput function. Testing the algorithm with other variances produces other configurations that clearly converge to some function in the limit of infinite HU's. However, the closer is the variance to zero, the more difficult is the optimization, as the configurations found become more and more complex, and convergence is attained much more slowly as the number of HU's grows. This is a feature common to all the algorithms that we have explored: it is hard to check functional invariance when the complexity reduction constraints are loose.
36
Vicente Ruiz de Angulo and Carme Torras
3 Conditions for the Appearance of Invariance Initially, when we found functional invariance while experimenting with the random weights learning algorithm, we thought it was a rather unique phenomenon, in the sense that it was particular to the kind of weight configurations that our algorithm created, or at least, that any algorithm exhibiting functional invariance should produce weights sharing the same essential properties. However, a deeper reflection revealed that functional invariance can appear when using different algorithms, and due to very diverse reasons. Up to now, we have identified three types of algorithms corresponding to three principles in which functional invariance can be based. 3.1 Neal's Type Priors Regularization methods can be considered under a bayesian perspective by viewing E (W ) and λ R (W ) as the negative log probability (disregarding some constants) of the output distribution of the function being approximated and the weight distribution, respectively. Then, C (W ) can be shown to be equivalent to the negative log of the posterior weight probability, and its minimization corresponds to a "maximum a posteriori" procedure. However, the use of the same prior on two different architectures does not imply a direct relationship between the functions they implement. In fact, a prior over the weights in different architectures can induce different priors over the output functions. Neal [4] devised a prior over the weights that, although inducing also a different prior over functions for each network, converges to a unique one as the number of hidden units tends to infinity. Convergence of the inputoutput functions posterior implies convergence of its mode. Thus, a procedure optimizing this posterior may be used to obtain functional invariance. Despite this, a prior over functions in the infinite number of HU's limit is not enough to directly imply the functional invariance of the minimization of C (W ) . In fact, this minimization finds the most probable weight vector, which does not correspond to the most probable inputoutput function, because there is a Jacobian determinant factor mediating the two probability densities [8], and the minimization of C (W ) optimizes the posterior of the weights, not that of the inputoutput functions. The true bayesian procedure, however, does not consist simply in finding the mode of the weight posterior, as carried out by the usual regularization method. Instead, it takes into account the complete probability distribution to generate an answer. For example, the bayesian answer to the question "what is the best output for a given X?" under a quadratic loss function would be guessing the average of the values for that point of the posterior of the inputoutput functions. This involves an integration over the probability space that cancels out the Jacobian determinant, so that it is the same to integrate over the weight posterior as to integrate over the inputoutput function posterior. Thus, this type of answer, considered as the output of the learning algorithm, makes the complete bayesian procedure functionally invariant as we have defined it.
Neural Learning Invariant to Network Size Changes
37
3.2 Regularizers Implying a Target Mean Function Input noise addition during training is equivalent to Tikhonov regularization when the number of patterns of the training set is infinite [1]. Even with finite training sets, it is equivalent to the addition of a penalization term [5, 6], although this is not a classical regularizer because it involves the output training patterns and depends on E (W ) . However, the function invariance property of noise addition is better understood by taking a wider perspective. The minimization of a quadratic E (W ) can be viewed as an attempt to estimate with the network the mean values taken by the output patterns for a given input. Usually, there are none or very few desired values for each point in the input space. However, with input noise addition, potentially infinite patterns are available, and the expected output pattern for a given input point is [3]: np m ( X ) = ∑ np s=1 Ys p ( X − X s ) ∑ k =1 p ( X − X k )
where
{ X i , Yi} i=1..np are
the original (not noisy) training patterns and p is the
probability density function of the noise. We require p( X ) to be non zero for every X in the input space, so that the randomly generated patterns cover it completely. This assures that m ( X ) is always welldefined, since otherwise the denominator could be zero somewhere. We need m ( X ) to be welldefined over the entire input space, because the zones of the input space that do not have desired values leave the network free to interpolate arbitrarily. Instead, if a target function is defined over all the input space, any family of networks with enough approximation power has to approximate the same function without degrees of freedom left. Thus, the fundamental condition for noise addition to be a function invariant method is the use of infinite domain probability density functions. An important remark is that the existence and uniqueness of m ( X ) independently of the architecture used, makes input noise addition functionally invariant in the more general sense given in Section 2, not limited to onehidden layer architectures. 3.3 Decomposable Regularizers Let us now talk about regularizers for onehidden layer networks that are additively decomposable, R (W ) = ∑ u r (w u ) , where w u is the vector of all incoming and outcoming weights of hidden unit u. We require also that r (w u ) has a minimum value of zero attained when w u = 0 1. These regularizers exhibit functional invariance if there exists a threshold for r (w u ) , such that when the number of units tends to infinite, all the units under this threshold tend to the minimum of r (w u ) , i.e., their associated weights tend to zero. This can be argued as follows: suppose we have a network with n hidden units that has been brought to a global minimum of C (W ) . In the limit of n = ∞ , if we order the values of r (w u ) , this sequence must tend to zero or, otherwise, R (W ) = ∞ in the 1
As a matter of fact, it would suffice that the weights from the input layer (excluding the bias unit) to u were zero in the minimum of r (w u ) .
38
Vicente Ruiz de Angulo and Carme Torras
minimum. Thus, there exists a finite number N of units whose r (w u ) is above the threshold and, therefore, are significantly non linear. The remaining n − N units contribute globally with a purely linear mapping A to the input of the output layer. For a network with n − 1 hidden units, a configuration with the same N nonlinear units and all but one of the same n − N linear units, reproduces the same function implemented by the n HU's with a difference that tends to zero as n grows. With an appropriate scaling of the linear units, the cost C (W ) would be the same because, on the one hand, since the n − N −1 units are linear, the mapping A can be recovered perfectly (therefore keeping E (W ) ) and, on the other hand, since they are on the minimum of r (w u ) , the infinitesimal scaling does not affect the regularization cost, because ∂r ∂w u = 0 . This is a global minimum of the n − 1 HU's network. If there was a lower minimum with different nonlinear units or a different linear mapping, it would be easy to build a weight configuration in the n HU's having the same E (W ) and the same or lower R (W ) , which would contradict the hypothesis that the original minimum of C (W ) for the n HU's network was global. Thus, functional invariance is guaranteed.
(a)
(b)
Fig. 2. Results of training a 4HU network and a 8HU network with 25 samples of the function sin( x1 + x 2 ) , using a weightdecay coefficient value of .4. The resulting configurations contain replicated units that fill all the spare units available.
The deterministic algorithm to emulate learning with random weights, which triggered this work, belongs to this category. It relies on the equivalence of weight noise addition with the addition of a decomposable regularizer, which for networks with linear function activations at the linear units takes the form [6,7]: 2 r (w u ) = a y u2 + b∑ m w mu y′u2
where a and b are constants, wmu is the weight from the hidden unit u to the output unit m, and y u and y′u are the activation function of u and its derivative, respectively. It can be observed that this regularizer satisfies the condition of having minimum at zero only when symmetric activation functions are used. But the most surprising example of this type of regularizer is the old good weightdecay. Careful experiments indicate that it satisfies the nonobvious condition mentioned in the first paragraph of this section. The theoretical demonstration is work still in progress. As an example, see Figure 2, where the weights resulting from training a 4HU network and a 8HU network with 25 samples of the function sin( x1 + x 2 ) are shown. Linear activation functions were used at the output units. One can see that in Figure 2(a) the first and the fourth units are replicated. In Figure 2(b),
Neural Learning Invariant to Network Size Changes
39
the same nonreplicated units as in (a) appear, and the two units that were replicated in (a) appear six times in (b) but with smaller weight magnitudes. This is the scaling we were talking about before. When the number of hidden units is very large, the weight magnitudes of the replicated units become practically null, but they keep globally forming the same linear mapping. We approximate experimentally the functional distance as the mean square distance between the outputs of two networks in a grid of 10,000 points regularly distributed in the input domain. Figure 3 shows the functional distances between architectures with different number of hidden units, minimized for several weightdecay coefficient values α . Since for some α 's, some of the networks fell in local minima, in these cases we used the best minimum selected from three different random trials. It is evident that as α grows, all the architectures tend to produce the same results. But the most interesting observation from this graph is that, for any α > 0, the distances between architectures decrease very quickly as the number of hidden units grows, and are indistinguishable from zero above 50 units. Notice that the comparisons involve networks that differ the more in the number of hidden units, the larger are the architectures. The above observation agrees with the expectation of a tendency to closer similarity for larger nets. Of course all the architectures exhibit almost the same generalization error for any positive α , especially those above 50 HU's.
Discussion In this paper we have put forth a definition of functional invariance, which basically states that a learning method is functionally invariant if, when applied to increasingly large networks, the output for every possible input tends to a limit, and have examined what kinds of methods possess this property. We have identified three mechanisms that can originate functional invariance:  bayesian treatment of neural networks with weight priors that converge to a prior over functions,  implicit definition of a mean target for the complete input space, and  additively decomposable regularizers that produce minima with a finite number of nonlinear units in the limit of infinite units. Examples of the first two types of mechanisms are bayesian learning with Neal's type weight priors, and input noise addition using probability density functions taking nonzero values in all their domain, respectively. Examples of the third type are the regularizer that emulates learning with random weights and classical weightdecay. There are relations between these mechanisms, but it is difficult to see a unifying principle. For example, the second type can be viewed as defining a prior over inputoutput functions, as the first type does, namely one that always concentrates the probability on a single function. However, this prior is the same for all networks using the second type of mechanism, while in the first mechanism the prior over functions is approximated only for large networks. In addition, since the target function is completely defined in the second type of mechanism, the distance to that function is
40
Vicente Ruiz de Angulo and Carme Torras
directly minimized, whereas in the first type, averaging over the probability distributions is required to guarantee functional invariance. There are also differences in the type of units that these mechanisms generate. The third type produces only a finite number of nonlinear units in the infinite limit, while the second gives rise to an infinite number of them. Take into account that the implicit target function can be anyone and, therefore, an infinite number of nonlinear units is required to approximate it [2]. It could seem strange that no one (up to our knowledge) had observed the functional invariance of, for example, weightdecay. However, careful optimizations are required to observe regular patterns in the weight configurations and thus see this property with sharpness. This does not mean that these results are not of practical relevance; as two different but large networks are brought moderately close to a global minimum, the functional distance between them becomes very low. The problem of falling into local minima can be significant, but their frequency using weightdecay is apparently not high. For example, in the experiment of Figure 3 comparing the functional distances of several nets with different weightdecay coefficients, among the numerous optimizations required, only three times we got a local minimum in a single trial. 0.0125
6 HU12 HU
0.01
Fig. 3. Evaluation of the similarity between the functions implemented by architectures with different numbers of hidden units. Values spanning a wide range of the weightdecay coefficient α are tested.
Functional distances
12 HU24 HU 24 HU50 HU
50 HU100 HU
0.0075
100 HU200 HU
200 HU400 HU 0.005
0.0025
References
1
0.8
0.6
0.4
0.2
0
0
α
1. Bishop, C.M.: "Neural networks for pattern recognition". Oxford University Press, 1995. 2. Hornik, K.: "Approximation capabilities of multilayer feedforward networks". Neural Networks, Vol 4 (2), pp. 251257, 1991. 3. Koistinen, P. and Holmstrom, L.: "Kernel regression and backpropagation training with noise", Advances in Neural Information Processing Systems 4, MorganKauffman, 1992. 4. Neal, R.M.: "Bayesian learning for neural networks". SpringerVerlag, New York, 1996. 5. Ruiz de Angulo, V. and Torras, C.: “Random weights and regularization”, ICANN´94, Sorrento, pp. 14561459, 1994. 6. Ruiz de Angulo, V. and Torras, C.: “A deterministic algorithm that emulates learning with random weights”, Neurocomputing (to appear). 7. Ruiz de Angulo, V. and Torras, C.: "Architectureindependent approximation of functions", Neural Computation, Vol. 13, No. 5, pp. 11191135, May 2001. 8. Wolpert, D.H. (1994): "Bayesian backpropagation over IO function rather than weights", Advances in Neural Information Processing Systems 6, Morgan Kauffman, 1999.
Boosting Mixture Models for Semisupervised Learning Yves Grandvalet1 , Florence d’Alch´eBuc2 , and Christophe Ambroise1 1
Heudiasyc, UMR CNRS 6599, Universit´e de Technologie de Compi`egne, BP 20.529, 60205 Compi`egne cedex, France {Yves.Grandvalet,Christophe.Ambroise}@hds.utc.fr 2 LIP6, UMR CNRS 7606, Universit´e Pierre et Marie Curie, 4, place Jussieu, 75252 Paris Cedex, France
[email protected] Abstract. This paper introduces MixtBoost, a variant of AdaBoost dedicated to solve problems in which both labeled and unlabeled data are available. We propose several deﬁnitions of loss for unlabeled data, from which margins are deﬁned. The resulting boosting schemes implement mixture models as base classiﬁers. Preliminary experiments are analyzed and the relevance of loss choices is discussed. MixtBoost improves on both mixture models and AdaBoost provided classes are structured, and is otherwise similar to AdaBoost.
1
Introduction
In many discrimination problems, unlabeled examples are massively available while labeled ones are scarce. Such situations arise in applications where class labeling cannot be automated and requires an expert. The concerned domains range from ecology to industrial process diagnosis but the two archetypal ones are web and text mining. Nowadays the latter receives much attention, and recent papers show the considerable improvements provided by taking into account a large pool of unclassiﬁed documents [8]. This paper investigates whether boosting can be adapted to semisupervised learning. Boosting is so successful in supervised classiﬁcation, that Breiman, quoted by Friedman et al. [5], called AdaBoost with trees the “best oﬀtheshelf classiﬁer in the world”. Basically, boosting concentrates on diﬃcult examples, also qualiﬁed as nearmiss examples. It is not obvious that it may be adapted to unclassiﬁed examples, for which misclassiﬁcation cannot be assessed, and still less that it can take advantage of them. From the brief review of AdaBoost given in section 2, ways of extending boosting to semisupervised problems are discussed in section 3. Section 4 then motivates the use of mixture models as base classiﬁer and details an original implementation. The resulting algorithm for boosting mixture models, MixtBoost, is tested on preliminary experiments in section 5, while section 6 summarizes the main conclusions and presents perspectives of this work. G. Dorﬀner, H. Bischof, and K. Hornik (Eds.): ICANN 2001, LNCS 2130, pp. 41–48, 2001. c SpringerVerlag Berlin Heidelberg 2001
42
2
Yves Grandvalet, Florence d’Alch´eBuc, and Christophe Ambroise
AdaBoost
AdaBoost is an acronym standing for adaptive boosting. Boosting refers to the problem of improving the performance of any rough “base classiﬁer” by combining. The adaptivity of AdaBoost refers to its capacity to boost without the prior knowledge of the accuracy of the base classiﬁers. There are several versions of boosting, and among them several forms of AdaBoost. For clarity sake, we only consider here the special case of binary classiﬁcation, but multiclass and regression tasks can also be tackled by this algorithm [3,4,5]. AdaBoost is an iterative learning method which produces a combination of simple classiﬁers. Provided the latter have error rates slightly better than random guessing, the ﬁnal combination is ensured to be arbitrarily accurate. The ﬁnal aggregated classiﬁer is deﬁned as sign(gT ), with gT (x) =
t=T
αt ht (x) ,
(1)
t=1
where αt denotes the (positive) weight of base classiﬁer ht in the combination. The procedure for generating ht and αt from a learning set of {xi , yi }ni=1 is summarized in Fig. 1. It only requires to deﬁne either a loss function l or a consistent margin ρ. Throughout this paper, we use the original proposition of Freund and Schapire [4], with h(x) ∈ [−1, 1] and y ∈ {−1, 1}. The loss is deﬁned as 1 l(h(x), y) = h(x) − y , (2) 2 with the corresponding margin ρ(h(x), y) = 1 − 2l(h(x), y) .
(3)
The latter relationship between loss and margin provides the usual deﬁnition of margin ρ(h(x), y) = h(x)y. 1. Start with weights w1 (i) = 1/n, i = 1, . . . , n, and g0 = 0 2. Repeat for t = 1 . . . T : (a) Fit the base classiﬁer ht (x) ∈ [−1, 1], providing it with the weighted empirical distribution deﬁned by w t = (wt (1), . . . , wt (n)). (b) Compute the weighted error: t = n i=1 wt (i)l(ht (xi ), yi ). (c) Compute αt = log(1 − t ) − log(t ), and gt = gt−1 + αt ht (d) Set the weights wt+1 (i) = exp(−ρ(gt (xi ), yi )) and normalize so that n i=1 wt+1 (i) = 1. 3. Output the ﬁnal classiﬁer sign (gT (x)) Fig. 1. AdaBoost algorithm
Boosting Mixture Models for Semisupervised Learning
43
Many empirical studies have shown that AdaBoost is an eﬃcient scheme for reducing the error rates of classiﬁers such as trees or neural networks. In the experiments of Schapire et al. [10], boosting almost systematically improves upon a single classiﬁer. This success is explained by the beneﬁcial eﬀect of boosting on margin distributions, the latter being related to bounds on the prediction error. The relevance of this connection is challenged by Breiman [3], who furthermore indicates that boosting may be viewed as an iterative optimization technique minimizing some decreasing function of the margin. With this point of view, the number of boosting steps T (1) is similar to the number of iterations in other optimization algorithms. Overﬁtting can be avoided by an early stopping procedure (such as in [10]), by regularizing to get “soft margins” [9], or by replacing exp(−ρ) at step 2(d) of Fig. 1, by other margin functionals such as 1 − tanh(ρ) [6].
3
Generalizing AdaBoost to Semisupervised Data
Once AdaBoost is provided with a base classiﬁer, its main components are the loss l and the margin ρ (at steps 2(b) and 2(d) in Fig. 1). The simplest way to generalize to semisupervised problems consists thus in deﬁning a loss/margin for unlabeled data. The generalization to unlabeled data should not aﬀect the loss for labeled examples, and should penalize inconsistencies between the classiﬁer output h(x) and available information. We ﬁrst assume that a true label exists for each example. Furthermore, missing labels are not due to censoring, but simply reﬂect the absence of class information. In particular, this excludes the case where labels are missing depending on the true label. In probabilistic terms, they are supposed to be missing at random. Missing labels are thus interpreted as “the example belongs to one class, but this class is unknown”. On the one hand, as a missing label indicates that any label is possible, no output is inconsistent with this information. Our ﬁrst “basic” generalization, denoted l0 , consists in giving a null loss to unlabeled examples. On the other hand, even if the label is missing, the truth is that only one of them is correct. The loss should thus penalize inconsistencies with plausible labels. A ﬁrst candidate solution is the loss associated with the most probable label lP (h(x), m) = l(h(x), ymax ) , with ymax = Argmax P (Y = yX = x) , y
(4)
where m denotes a missing label. Our second proposal is the expected loss with respect to possible labels lP (h(x), m) = E(l(h(x), Y )) .
(5)
Posterior probabilities P(Y = yX = x) being unknown, some estimates have to be used. They can be derived from the classiﬁer h or from the combined
44
Yves Grandvalet, Florence d’Alch´eBuc, and Christophe Ambroise
classiﬁer g.1 Considering that (h(x) + 1)/2 or (g(x)/α1 + 1)/2 estimates the posterior probability P(Y = 1X = x), two approximations of losses (4) (respectively denoted lh and lg ) and (5) (respectively denoted lh and lg ) are deﬁned. With these losses, it should be kept in mind that the posterior probabilities are estimated. Paradoxical situations may arise when the latter are not provided by the classiﬁer at hand h. In particular, an unlabeled example may be reported to be badly classiﬁed while its true label is unknown. This situation is prohibited by restricting the loss function to be in [0, 1/2] for lg and lg . Table 1 summarizes the losses and margins for missing labels which are considered throughout our experiments at step 2(b) of Fig. 1. Margins for h are obtained from losses by the linear transformation (3). Margins for g, required at in step 2(d), are computed by replacing h by the normalized classiﬁer g α 1 the second column, and unnormalizing (ρ(g(x), m) = α1 ρ(g(x) α1 , m)). Table 1. Losses and margins for a missing label m Loss l(h(x), m)
ρ(h(x), m)
ρ(g(x), m)
l0 lh
1 h(x)
α1 g(x) g(x)2 α1
lh lg lg
0
1 h(x) − sign(h(x)) 2 1 (1 − h(x)2 ) 2 1 min(h(x) − sign(g(x)), 1) 2 1 min(1 − h(x)g(x) α1 , 1) 2
h(x)2
max(h(x)sign(g(x)), 0) g(x) max(h(x)g(x) α1 , 0) g(x)2 α1
All loss options amount to assume that any classiﬁer output is to some extent correct for unlabeled examples. Their error measure is systematically lower than 1/2, and their corresponding margin is positive. For l0 , all outputs are equally correct, for lh and lh , selfconﬁdent outputs are privileged, and for lg and lg , outputs consistent with the aggregated classiﬁer are preferred.
4
Base Classiﬁer
The base classiﬁer should be able to make use of the unlabeled data provided by the boosting algorithm. Mixture models are well suited for this purpose, as shown by their extensive use in clustering for representing diﬀerent subpopulations. Hierarchical mixtures provide ﬂexible discrimination tools, where each conditional distribution f (xy = k) is modelled by a mixture of components [2]. At the high level, the distribution is described by f (xΦ) =
K
pk fk (x; θk ) ,
(6)
k=1 1
The outputs of the normalized combined classiﬁer g(x)/α1 , where α1 = T 1 t=1 αt is the norm of α, span [−1, 1].
Boosting Mixture Models for Semisupervised Learning
45
where K is the number of classes, pk are the mixing proportions, θk the conditional distribution parameters, and Φ denotes all parameters {pk ; θk }K k=1 . The highlevel description can also be expressed as a lowlevel mixture of components, as shown here for binary classiﬁcation: K1
f (xΦ) =
pk1 fk1 (x; θk1 ) +
k1 =1
K2
pk2 fk2 (x; θk2 ) .
(7)
k2 =1
In this setting, the EM algorithm can be used to maximize the loglikelihood with respect to Φ considering that the incomplete data is {xi , yi }ni=1 and the missing data is the component label cik , k = 1, . . . , K1 + K2 [7]. We consider an original implementation of EM based on the concept of possible labels [1]. It is well adapted to hierarchical mixtures, where the class label y provides a subset of possible components. When y = 1 the ﬁrsts K1 modes are possible, when y = −1 the lasts K2 modes are possible, and when an example is unlabeled, all modes are possible. A binary vector zi ∈ {0, 1}(K1 +K2 ) indicates the components from which feature vector xi may have been generated, in agreement with the assumed mixture model and the (absence of) label yi . Assuming that the training sample {xi , zi }ni=1 is i.i.d, the weighted loglikelihood is given by L(Φ; {xi , zi }ni=1 ) =
n
wt (i) log f (xi , zi ; Φ) ,
(8)
i=1
where wt (i) are the weights provided by boosting at step t. L is maximized using the following EM algorithm: EStep Computation of the expectation of L(Φ; {xi , zi }ni=1 ) conditionally to {xi , zi }ni=1 and the current value of Φ (denoted Φq ): q
Q(ΦΦ ) =
n K 1 +K2 i=1
wt (i)uik log (pk fk (xi ; θk )) ,
(9)
k=1
zik pk fk (xi ; θk ) with uik = . zi p f (xi ; θ ) MStep Maximization of Q(ΦΦq ) with respect to Φ. Assuming that each mode k follows a Gaussian distribution with mean µk , q+1 q+1 K1 +K2 and covariance Σ k , Φq+1 = {pq+1 is given by: k ; µk ; Σ k }k=1 wt (i)uik xi i wt (i)uik ; µq+1 pq+1 = = ; (10) k k i wt (i) i wt (i)uik i n 1 q+1 t wt (i)uik (xi − µq+1 k )(xi − µk ) . w (i)u ik i=1 i t
= Σ q+1 k
The parameters of the hierarchical model are thus estimated altogether.
(11)
46
5 5.1
Yves Grandvalet, Florence d’Alch´eBuc, and Christophe Ambroise
Preliminary Experiments Experimental Setup
Preliminary tests of the algorithm are performed for three benchmarks of the boosting literature: banana [9], twonorm and ringnorm [3]. Banana is a twodimensional two classes benchmark with 400 training patterns and 4900 test patterns. The class shapes are complex and the best reported test error rate is 10.7% [9]. Twonorm and ringnorm are 20dimensional two classes benchmarks with 300 training patterns and 4000 test patterns. Classes are Gaussian clusters, Bayes error rates are low (respectively 1.5% and 2.3%), and R¨atsch et al. [9] report error rates of 1.6% and 2.7%. For all benchmarks, training is performed on 10 diﬀerent samples and in ﬁve diﬀerent settings where 0%, 50%, 75%, 90% and 95% of labels are missing. MixtBoost is tested for all margins deﬁned in section 3, and is compared to mixture models and AdaBoost. MixtBoost and AdaBoost are trained identically, the only diﬀerence being that AdaBoost is not fed with missing labels. Both algorithms are run for T = 100 boosting steps, without protection against overﬁtting. The base classiﬁer is a hierarchical mixture model with an arbitrary choice of 4 modes per class. Mixture models provided for comparison use the same hyperparameters, but the algorithm (which may be stalled in local minima) is restarted 100 times from diﬀerent initial solutions, and the best ﬁnal solution (regarding training error rate) is selected. 5.2
Results
We report mean error rates together with the lower and upper quartiles in table 2. Compared to EM, AdaBoost provides usually good results for low rates of missing labels, but it is always outperformed for the highest rate. The mediocre performances of EM is due to convergence problems and to the criterion used to select a solution which may be inadequate when many examples are unlabeled. Without missing labels, MixtBoost is equivalent to AdaBoost and thus provides identical error rates. For all settings, the loss l0 yields equal or worse performances than AdaBoost. Providing the highest margin to unlabeled data results in an exponential forgetting of all unlabeled examples. It is thus puzzling to observe occasional big diﬀerences between l0 and AdaBoost in twonorm and ringnorm. In fact, learning curves (not provided here) show that l0 performs usually quite as well as AdaBoost in the very ﬁrst boosting steps, but then its performances degrade heavily while AdaBoost stops rapidly. For twonorm and ringnorm, the loss lh never beats signiﬁcantly AdaBoost, and often provide extremely bad solutions, where nearly the whole space is assigned to a single class, resulting in 50% error rates. The algorithm is extremely unstable; the combination coeﬃcients diverge: at the base classiﬁer level, unlabeled examples are classiﬁed with hard margins to opposite classes throughout the boosting process. The smoother version lh provides seemingly better results, but the instability remains, and worse results should be expected when increasing the number of boosting steps. For Banana, the good performance of lh (and
Boosting Mixture Models for Semisupervised Learning
47
Table 2. Error rates (in %) obtained with 5 percentages of missing labels for mixture models (EM), AdaBoost (AB) and MixtBoost with the ﬁve margins deﬁned in section 3. The ﬁrst ﬁgure is the mean error rate, and the interval displays the interquartile range. Banana EM AB l0 lh lh lg lg
0%
50%
75%
90%
95%
15.2[14.4,15.3] 11.7[11.3,12.0] 11.7[11.3,12.0] 11.7[11.3,12.0] 11.7[11.3,12.0] 11.7[11.3,12.0] 11.7[11.3,12.0]
18.2[16.7,18.6] 12.6[11.7,13.1] 12.4[11.7,12.8] 12.8[12.2,13.3] 12.3[11.7,12.9] 12.5[11.8,13.1] 12.7[12.1,13.1]
21.8[18.0,25.0] 15.2[13.0,16.8] 15.3[13.0,16.9] 16.5[14.6,17.8] 14.7[12.9,15.2] 15.0[13.3,15.4] 15.1[13.6,16.4]
26.1[20.7,29.8] 22.1[18.0,24.3] 22.0[18.2,24.0] 23.9[20.1,24.7] 17.5[15.3,19.3] 18.4[15.9,19.7] 19.7[15.6,21.9]
31.7[23.8,35.8] 37.5[32.2,42.2] 35.2[23.9,40.4] 30.7[25.7,34.3] 25.6[21.3,25.7] 32.5[22.1,35.6] 35.3[26.8,40.7]
50%
75%
90%
95%
20.6[10.3,22.5] 11.0[ 5.2,14.2] 28.0[ 6.0,38.5] 49.6[49.7,50.0] 16.1[ 8.3,20.6] 4.2[ 3.4, 4.9] 3.9[ 2.8, 3.5]
24.8[18.3,31.9] 38.9[29.4,50.0] 38.5[24.0,48.4] 47.4[49.7,50.1] 17.3[ 9.7,22.5] 6.2[ 3.6, 7.7] 7.2[ 3.4, 8.4]
90%
95%
Twonorm 0% EM AB l0 lh lh lg lg
2.7[ 3.2[ 3.2[ 3.2[ 3.2[ 3.2[ 3.2[
2.5, 2.9, 2.9, 2.9, 2.9, 2.9, 2.9,
Ringnorm 0% EM AB l0 lh lh lg lg
1.9[ 1.8[ 1.8[ 1.8[ 1.8[ 1.8[ 1.8[
1.7, 1.6, 1.6, 1.6, 1.6, 1.6, 1.6,
2.9] 3.2[ 2.7, 3.1] 6.5[ 3.0, 9.0] 3.3] 3.2[ 2.9, 3.2] 3.2[ 3.0, 3.5] 3.3] 3.1[ 2.7, 3.2] 2.9[ 2.7, 3.0] 3.3] 50.0[50.0,50.0] 50.0[50.0,50.0] 3.3] 3.9[ 3.2, 3.8] 13.2[ 6.0,16.4] 3.3] 2.9[ 2.6, 3.0] 3.0[ 2.8, 3.2] 3.3] 2.8[ 2.6, 3.0] 3.1[ 2.9, 3.2] 50%
75%
2.0] 2.1[ 1.7, 2.1] 4.3[ 1.9, 5.7] 9.5[ 2.7,12.0] 1.9] 1.8[ 1.6, 2.0] 3.1[ 1.9, 4.1] 11.5[ 4.2,12.1] 1.9] 1.7[ 1.6, 1.7] 2.9[ 1.9, 2.5] 24.0[ 2.2,44.4] 1.9] 50.0[50.0,50.0] 49.9[49.9,50.0] 49.9[49.8,50.0] 1.9] 2.0[ 1.7, 2.3] 8.2[ 3.1,10.3] 13.2[ 3.3,17.9] 1.9] 1.7[ 1.6, 1.7] 2.0[ 1.8, 2.2] 3.7[ 2.0, 3.8] 1.9] 2.0[ 1.7, 1.9] 2.2[ 2.0, 2.5] 3.7[ 1.9, 4.0]
23.7[14.5,27.0] 28.7[11.5,37.6] 45.8[42.9,49.7] 50.0[50.0,50.0] 29.9[11.9,46.8] 5.8[ 2.3, 6.3] 4.5[ 2.9, 4.9]
to a lesser extent lh ), are also an artifact due to early stopping, as instability can also be observed on the combination coeﬃcients. Regarding the versions of MixtBoost based on the estimation of posterior probabilities by g, the results are qualitatively similar for twonorm and ringnorm, with very high improvements on both mixture models and AdaBoost for high rates of missing labels. These versions beneﬁt from the consistent responses of base classiﬁers with the aggregated classiﬁer g. On the other hand, one would expect lg to be consistently more accurate than the hard version lg , but there is no sign of such (nor opposite) systematic behavior. For banana, MixtBoost is not very eﬃcient: its results are only slightly superior to AdaBoost for high rates of missing labels. We conjecture that the complex class structure prevents the base classiﬁer to extract much information from unlabeled examples.
48
6
Yves Grandvalet, Florence d’Alch´eBuc, and Christophe Ambroise
Conclusion
MixtBoost, a new boosting scheme for semisupervised learning was proposed. It generalizes Adaboost by consistently extending the deﬁnitions of loss and margin to unlabeled data. Our experimental results show that boosting can improve the base classiﬁer performances, especially when only a few labeled examples are provided. MixtBoost is thus attractive in applications such as web and text mining. Largescale experiments have yet to be performed, but in other experiments not reported here for lack of space, high improvements are still obtained for high Bayes error rates. MixtBoost can thus deal with diﬃcult classiﬁcation problems, provided data are structured into clusters, i.e. when unlabeled data convey relevant information for the classiﬁcation task. A semisupervised task involves examples for which the label is precisely known or completely unknown. When labeling is performed by an expert, the true state of knowledge is better represented by membership to class subsets. For example, in medical diagnosis, a physician is sometimes able to discard some diseases, but not to pinpoint the precise illness of his patient. As mixture models can handle partial class information, MixtBoost should be easily extended to these partially supervised problems.
References 1. C. Ambroise and G. Govaert. EM algorithm for partially known labels. In IFCS 2000, july 2000. 2. C. M. Bishop and M. E. Tipping. A hierarchical latent variable model for data vizualization. IEEE PAMI, 20:281–293, 1998. 3. L. Breiman. Prediction games and arcing algorithms. Technical Report 504, Statistics Department, University of California at Berkeley, 1997. 4. Y. Freund and R. E. Schapire. A decisiontheoretic generalization of online learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119–139, 1997. 5. J. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression: a statistical view of boosting. The Annals of Statistics, 28(2):337–407, 2000. 6. L. Mason, J. Baxter, P. L. Bartlett, and M. Frean. Functional gradient techniques for combining hypotheses. In Advances in Large Margin Classiﬁers. MIT, 2000. 7. G. J. McLachlan and T. Krishnan. The EM algorithm and extensions. Wiley, 1997. 8. K. Nigam, A. McCallum, S. Thrun, and T. Mitchell. Using EM to classify text from labeled and unlabeled documents. Machine Learning, to appear. 9. G. R¨ atsch, T. Onoda, and K.R. M¨ uller. Soft margins for AdaBoost. Machine Learning, 42(3):287–320, 2001. 10. R. E. Schapire, Y. Freund, P. Bartlett, and W. S. Lee. Boosting the margin: A new explanation for the eﬀectiveness of voting methods. The Annals of Statistics, 26(5):1651–1686, 1998.
Bagging Can Stabilize without Reducing Variance Yves Grandvalet Heudiasyc, UMR CNRS 6599, Universit´e de Technologie de Compi`egne, BP 20.529, 60205 Compi`egne cedex, France
[email protected] Abstract. Bagging is a procedure averaging estimators trained on bootstrap samples. Numerous experiments have shown that bagged estimates almost consistently yield better results than the original predictor. It is thus important to understand the reasons for this success, and also for the occasional failures. Several arguments have been given to explain the eﬀectiveness of bagging, among which the original “bagging reduces variance by averaging” is widely accepted. This paper provides experimental evidence supporting another explanation, based on the stabilization provided by spreading the inﬂuence of examples. With this viewpoint, bagging is interpreted as a caseweight perturbation technique, and its behavior can be explained when other arguments fail.
1
Introduction
Bagging, introduced by Breiman in 1994, is a procedure building an estimator by a resample and combine technique [1]. From an original estimator, its bagged version is produced by averaging several replicates trained on bootstrap samples. Regarding prediction error, it seems that bagging is an almost universal procedure for improving upon a single predictor. Hence, it is very important to understand the reasons for its successes, and also for its occasional failures. In many studies, the method almost systematically compares favorably with the original predictor, on artiﬁcial as on real data [1,5,10,11]. Other ensemble methods such as boosting and arcing are often more eﬀective in reducing test errors, but in situations with substantial noise, bagging performs better [5,10]. Available explanations for bagging success are brieﬂy reviewed in section 2, where the present explanation is also summarized. We then provide new experimental evidence supporting that bagging stabilizes prediction, in the sense that it equalizes the inﬂuence of each example in building the predictor. Two illustrative examples are given in regression and classiﬁcation respectively in sections 3 and 4. Our ﬁndings are ﬁnally discussed in section 5.
2
How Does Bagging Work?
Breiman [1] presents bagging as a variance reduction procedure mimicking averaging over several training sets. The approximation taking place should be G. Dorﬀner, H. Bischof, and K. Hornik (Eds.): ICANN 2001, LNCS 2130, pp. 49–56, 2001. c SpringerVerlag Berlin Heidelberg 2001
50
Yves Grandvalet
kept in mind: averaging is performed on bootstrap replicates of a single training set, and not on diﬀerent training sets. Thus, although experimental results often shown the expected variance reduction [1,11], several other arguments have been given to explain the success of bagging. Schapire et al. [11] provide bounds for voting algorithms, including bagging, relating the generalization performance of aggregated classiﬁers to the margin distribution of examples. Unlike boosting, bagging does not explicitly maximizes margins, but experiments show that for complex classiﬁers, bagging produces rather large margins [11]. However, the obtained bounds are acknowledged to be loose, and even qualitatively misleading according to Breiman [2]. Friedman and Hall [7] provide an asymptotical analysis which concludes that, in the limit of inﬁnite samples, bagging reduces a nonlinear component of the predictor variance, while the linear part is unaﬀected. B¨ uhlmann and Yu [3] present another theoretical study, focused on neighborhood of discontinuities of decision surfaces. From these two studies, we retain that bagging asymptotically performs some smoothing on the estimate. Smoothing also occurs for ﬁnite samples, but it may not be the major eﬀect of bagging. In particular, asymptotic analyses are not suited to the treatment of outliers, which are particularly well handled by bagging [5,10]. In a previous study, the author [8] showed how bagging aﬀects the potential inﬂuence of training examples on the predictor. The limits of the “reduce variance by averaging” argument were already illustrated in point estimation on a very simple mean estimation [8]. On the one hand, bagging was shown to increase or decrease the estimator variance according to the experimental setup. On the other hand, each example of the training sample was shown to be reweighed, and this reweighing systematically equalized the inﬂuence of training examples. This paper presents two “classical” examples, one in regression and one in classiﬁcation, where the variance argument also breaks down, while the explanation based on stabilization still holds.
3
Regression
The ﬁrst example showing the limits of the variance argument was given by Breiman himself, in a linear regression experiment [1]. The original aim was to illustrate the beneﬁts of bagging for unstable estimation procedures such as subset selection. A surprising byproduct is that bagging is harmful for ordinary least squares (OLS). Breiman explains this failure by stating that, for OLS, averaging predictors trained on several independent datasets is better approximated by averaging over a single dataset drawn from the data distribution (original predictor) than by averaging over several bootstrap samples (bagged predictor). This statement implies that bagging fails when the variance argument reaches its limits. Here, we illustrate that the stabilization eﬀect of bagging explains its successes and failures. The experimental setup (see [1] for more details) consists in replicating 250 times: 1) draw samples of size 60 from the model y = β T x + ε, where ε is drawn
Bagging Can Stabilize without Reducing Variance
51
from a normal distribution N (0, 1), x ∈ R30 and β has 27 nonzero coeﬃcients (results are qualitatively equivalent for the two other setups described in [1]); 2) compute estimates of β by forward subset selection. 3) generate 50 bootstrap samples to compute bagged estimates. The diﬀerence in quadratic prediction error is highly in favor of bagging for little subset sizes. With a single variable, the average error over the 250 experiments is 3.26 without bagging and 2.69 with bagging. This advantage decreases as the subset size increases, and there is a crossover point past which bagging is detrimental. For OLS, the average error is 2.03 without bagging and 2.11 with bagging. First, we stress that, in the present setting, the failure of bagged OLS can be proved to be due to an increase of variance, since both unbagged and bagged estimates are unbiased. For subset selection, there is no analytic expression for bias and variance. Their plugin estimates show however that bias is not aﬀected by bagging, while variance is reduced. In this experiment, the eﬀect of bagging regarding variance is irregular. A regular eﬀect on the potential inﬂuence of examples, which turns out to have diﬀerent consequences in terms of variance, is exhibited below. In OLS, potentially highly inﬂuential points are known as leverage points, and the smoothing or hat matrix provides the statistics commonly used to ﬂag f them. Let T = {(xi , yi )}ni=1 be the training set. The ndimensional vector n of ﬁtted values at {xi }i=1 (fi = f (xi )) can be written as f = Sy, where y is the ndimensional vector of response variables and S is the n × n smoothing matrix [9]. The ith row of S represents the sequence of weights (or equivalent kernel) used to compute f(xi ). Each element of S is thus relevant for detecting leverage, but a good summary is provided by the diagonal elements Sii , since ∂ f(xi ) ∂yi = Sii . As trace(S) is a possible deﬁnition of the degrees of freedom of the smooth [9], Sii can also be interpreted as the degrees of freedom spent to ﬁt (xi , yi ). Leverage points are thus associated with large Sii . These statistics can be computed for bagged OLS [8], but not for subset selection, which is not a linear estimate as y intervenes in the subset choice. For the latter, generalized statistics Sii , based on a data perturbation approach similar to [4] are computed. bag Fig. 1 compares the histograms of Sii and Sii over all experiments. The distribution for OLS (left) is unimodal, centered on 0.5 (30 free parameters are set by 60 points). Bagging provides a similar distribution, also centered on 0.5, meaning that complexity is not modiﬁed. The spread is however halved, rendering the equalization of inﬂuence among examples. The distribution is more complex for subset selection (right). A logscale is used to highlight that it is bimodal (on a linear scale, the high spread of the minor mode makes it hardly visible). The main mode contains about 90% of points with small Sii values (mean about 1/60). It gathers points with little inﬂuence on the choice of the element entering the subset. The other mode contains highly inﬂuential points, with a mean Sii value of about 0.6. These points play a leading part in choosing the variable entering the subset. The mean number of degrees of freedom is about 4.6 with a single variable: this
52
Yves Grandvalet
0.2
0.4
0.6
Sii
0.8
6
4
2
log10 (Sii )
0
bag Fig. 1. Left: histograms of Sii (up, light grey) and Sii (down, dark grey) for OLS (all bag ) (down, dark variables); right: histograms of log10 (Sii ) (up, light grey) and log10 (S˜ii grey) for subset selection (one variable)
inﬂation is due to the subset choice, as discussed in [12]. The distribution for the bagged subset selection estimate is unimodal, centered on 0.1, with an average 5.9 degrees of freedom. Inﬂuence equalization is summarized by the halving in bag spread. S˜ii The stabilization eﬀect of bagging being displayed, its relationship with prediction error is investigated. The expected diﬀerence in prediction error between the OLS and the bagged estimate is a quadratic function in smoothing matrices S and Sbag . The diﬀerence in normalized spread of leverage statistics, n 1 S2 n Spread(f ) = ni=1 ii2 − 1 , (1) 1 i=1 Sii n is used as a (far from exhaustive) summary statistic to depict the trend in the relationship between leverage statistics and prediction error. Fig. 2 displays the diﬀerence in prediction error between bagged and original estimates according to this statistic. Each point represents one of the 250 experiments. For OLS (left), diﬀerences in prediction error and diﬀerences in leverage statistics are positively correlated, while the opposite trend is observed for subset selection (right). For both regression schemes, performances of bagged predictors are thus shown to be related to the equalization of leverage statistics. The latter is performed blindly, without any assessment of the relevance of such a stabilization. In the considered experimental setting, stabilization happens to be detrimental for OLS (due to the absence of outliers) and beneﬁcial for subset selection. For the latter, using more variables in the bagged estimate is not essential to bagging improvements, as bias is not aﬀected in the process. The crux is that more points intervene in the subset choice, so that variance is reduced.
4
Classiﬁcation
Failures of bagging in classiﬁcation are rare. Schapire et al [11] provide however an example in the ringnorm benchmark. They report that stumps (single node trees) achieve a test error rate of 40.6%, compared to 41.4% for bagged stumps.
Bagging Can Stabilize without Reducing Variance
53
0.2
0.14
0.4 0.6
0.12
0.8 0.1
1 1.2
0.08
1.4 0.035 0.03 0.025 0.02 0.015
Spread(fbag ) − Spread(f)
15
10
5
0
Spread(fbag ) − Spread(f)
Fig. 2. left: expected diﬀerence in prediction error E(PEbag − PE) vs (Spread(fbag ) − Spread(f)) for OLS (all variables); right: observed diﬀerence in prediction error (PEbag − PE) vs (Spread(fbag ) − Spread(f)) for subset selection (one variable).
Ringnorm was proposed by Breiman as a diﬃcult task for decision trees, since the Bayes boundary is spherical. There are 20 features and 2 classes. Class 1 is multivariate normal with mean zero and covariance matrix four √ times the identity. Class 2 is multivariate normal with mean (a a a . . . a), a = 1 20 and covariance matrix identity. The training sets comprise 300 examples, prediction error is estimated on test sets of size 10000, and 250 trials are performed. In our experiments, the mean prediction error of the bagged estimate is identical to the one of the original estimate (40.4%, with an insigniﬁcant diﬀerence of 0.05% between the two means). Signiﬁcant diﬀerences between the two estimates are however observed in 35% of experiments (at the 5% level), with a standard error of diﬀerences in prediction error of 1.6%. The results are more variable with the bagged predictor, with a standard deviation of prediction error of 1.4% versus 0.7% for unbagged stumps. There is no general agreement on what the bias/variance decomposition should be for classiﬁcation problems (see [6] for a review in the binary case), and such a discussion is out of the scope of this paper. However, for all deﬁnitions retaining an additive decomposition where variance represents the variable part of prediction error, our results show that bagging increased variance. As no classical leverage statistic exists in classiﬁcation, the inﬂuence of example i is measured by the diﬀerence between the predictor f trained with the whole training sample and the predictor f−i trained with the sample deprived of example i, T −i = {xj , yj }j=i . The most relevant diﬀerence measure in classiﬁcation is the expected disagreement in decisions, i.e the probability Pr(f(X) = f−i (X)), which is easily accurately estimated in simulated experiments. Fig. 3 displays the cumulative distribution of Pr(f(X) = f−i (X)) for stumps and bagged stumps. Histograms provide a poor visualization, since the distribution for stumps has a step at zero and is furthermore heavytailed. For stumps, about 94% of examples have no inﬂuence on the predictor, in the sense that the classiﬁer is not modiﬁed when one of them is deleted. The absence of the other examples causes important changes, with f−i disagreeing
54
Yves Grandvalet 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
0
22
100 Pr(f(X) = f−i (X)])
45
Fig. 3. Cumulative distribution function of Pr(f(X) = f−i (X)) (in %) for stumps (dashed) and bagged stumps (solid)
with f in up to 45% of test cases. Overall, the mean disagreement is 1.7%, with a standard deviation of 7.3%1 . For bagged stumps, only 37% of examples have no inﬂuence on the predictor, and f−i disagree with f at most in 16% of test cases. Here, the mean disagreement is reduced to 1.3%, and its standard deviation is only 1.6%. Bagging is thus again shown to equalize the inﬂuence of examples. The plot of diﬀerences in prediction error versus the variance of normalized inﬂuence does not show any trend. Inﬂuential points are sometimes beneﬁcial and sometimes detrimental, resulting in identical prediction errors for stumps and bagged stumps. This plot is not reproduced here for lack of space; instead, we illustrate how inﬂuence is distributed on the margin distribution. In this binary classiﬁcation problem, coding labels and stump classiﬁer outputs in {−1, 1}, the margin is deﬁned as ρ(fbag (x), y) = y fbag (x) [11]. It represents the conﬁdence by which the example is correctly classiﬁed (negative margins correspond to misclassiﬁcation). The margin distribution is essential in the analysis of arcing or boosting algorithms [2,11]. In particular, AdaBoost is shown to shift the margin distribution towards positive values by placing most weight on examples with small margins [11]. Fig. 4 displays the mean inﬂuence of points according to their margins. For stumps, most inﬂuential examples correspond to intermediate margins. Bagging results in a spread on the margin distribution, increasing the inﬂuence of hard and easy examples, while decreasing the one given to intermediate cases. Unlike AdaBoost, it does not concentrate speciﬁcally on diﬃcult examples. This behavior may explain why bagging is less eﬀective than AdaBoost when there is little classiﬁcation noise, and why it is better otherwise [5,10].
1
Note that these important diﬀerences are not reﬂected by the prediction error of f−i , as many diﬀerent predictors can provide an error rate of about 40%.
Bagging Can Stabilize without Reducing Variance
55
4 3.5 3 2.5 2 1.5 1 0.5 0 1
0.5
0
y fbag (x)
0.5
1
Fig. 4. Mean disagreement Pr(f(X) = f−i (X)) (in %) according to the margin of training examples for stumps (dashed) and bagged stumps (solid)
5
Discussion
In bagging, each example appearing the same number of times over all bootstrap samples, it may seem paradoxical that its weight is modiﬁed. Our experiments however illustrate that bagging equalizes the inﬂuence of examples in setting a predictor. Less point have a small inﬂuence, while the highly inﬂuential ones are downweighted. These observations support one of Breiman’s early statements : the vital element for gaining accuracy thanks to bagging is the instability of the prediction method [1]. However, stability is no more related to variance, but to the presence of inﬂuential examples in the dataset for a given prediction method. In many situations, highly inﬂuential points are outliers, and their downweighting reduces the variance of the estimator. If an “unstable predictor” is characterized by severe leverage points, then our analysis is in accordance with Breiman’s: bagging stabilizes estimation and reduces variance. But the present explanation, which is not rooted in the averaging argument, also applies when bagging fails: the stabilization eﬀect of bagging is harmful when estimation accuracy beneﬁt from inﬂuential points (the socalled good leverage points). Regarding margins, the classiﬁcation experiment showed that bagging does not concentrate speciﬁcally on diﬃcult examples, but spreads the inﬂuence of examples on the margin distribution. This behavior may explain why bagging is much less eﬀective than AdaBoost when diﬃcult examples indicate class boundaries, and why it is more robust to classiﬁcation noise. This experimental study also illustrated the limits of the asymptotic analyses stating that a linear estimate is not aﬀected by bagging in the limit of inﬁnite datasets [7,3]. Bagging was shown to modify the ordinary least squares (linear) predictor for a ﬁnite sample size. On the other hand, the upweighting of points with little inﬂuence may be responsible for the smoothing eﬀect which is observed at ﬁnite sample sizes and asymptotically remains.
56
Yves Grandvalet
With the inﬂuence equalization viewpoint, bagging can be interpreted as a perturbation technique aiming at improving the robustness against outliers. Indeed, averaging predictors trained on perturbated training samples is a means to favor invariance to these perturbations. Bagging applies this heuristic to caseweight perturbations. Bootstrap sampling, which is a central element of bagging according to the variance reduction argument, is only one possibility among others to provide these perturbations. The eﬀectiveness of other resample and combine schemes involving subsampling [3,7] can be understood within this perspective.
References 1. L. Breiman. Bagging predictors. Machine learning, 24(2):123–140, 1996. 2. L. Breiman. Prediction games and arcing algorithms. Technical Report 504, Statistics Department, University of California at Berkeley, 1997. 3. P. B¨ uhlmann and B. Yu. Explaining bagging. Technical Report 92, Seminar f¨ ur Statistik, ETH, Z¨ urich, 2000. 4. A. N. Burgess. Estimating equivalent kernels for neural networks: A data perturbation approach. In M.C. Mozer, M.I. Jordan, and T. Petsche, editors, Advances in Neural Information Processing Systems 9, pages 382–388. MIT Press, 1997. 5. T.G. Dietterich. An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting and randomization. Machine Learning, 40(2):1–19, 2000. 6. J. H. Friedman. On bias, variance, 0/1 loss, and the curse of dimensionality. Data Mining and Knowledge Discovery, 1(1):55–77, 1997. 7. J. H. Friedman and P. Hall. On bagging and nonlinear estimation. Technical report, Stanford University, Stanford, CA., January 2000. 8. Y. Grandvalet. Bagging downweights leverage points. In S.I. Amari, C. Lee Giles, M. Gori, and V. Piuri, editors, IJCNN, volume IV, pages 505–510. IEEE, 2000. 9. T. J. Hastie and R. J. Tibshirani. Generalized Additive Models, volume 43 of Monographs on Statistics and Applied Probability. Chapman & Hall, 1990. 10. R. Maclin and D. Opitz. An empirical evaluation of bagging and boosting. In Proceedings of the Fourteenth National Conference on Artiﬁcial Intelligence, pages 546–551. AAAI Press, 1997. 11. R. Schapire, Y. Freund, P. Bartlett, and W. S. Lee. Boosting the margin: A new explanation for the eﬀectiveness of voting methods. The Annals of Statistics, 26(5):1651–1686, 1998. 12. R. J. Tibshirani and K. Knight. The covariance inﬂation criterion for adaptive model selection. Journal of the Royal Statistical Society, B, 61(3):529–546, 1999.
Symbolic Prosody Modeling by Causal Retrocausal NNs with Variable Context Length Achim F. M¨ uller and Hans Georg Zimmermann Siemens Corporate Technology, OttoHahnRing 6, D81739 Munich, Germany {achim.mueller,georg.zimmermann}@mchp.siemens.de
Abstract. In this paper the application of causal retrocausal neural networks (NN) to accent label prediction for speech synthesis is presented. Within the proposed NN architecture gating clusters are applied enabeling the dynamic adaptation of a network structure depending on the actual input to the NN. In the proposed causal retrocausal NN, gating clusters are used to adapt the network structure such that the network has a variable context length. This way only available input feature vectors from the actual context window are treated. The proposed NN architecture has been successfully applied for accent label prediction within our texttospeech (TTS) system. Prediction accuracy ranges at 83%. This result ranges higher than results achieved with treebased (CART) methods on a corpus with similar complexity.
1
Introduction
In many TTS systems the complex task of f0generation is split into two parts [3]. In the ﬁrst part symbolic prosody labels are generated. These labels are used as inputs to the second part that generates f0contours. Symbolic prosody labels can be separated into two types of labels: phrase break labels and accent labels. In our system phrase break labels are generated for a whole sentence prior to accent labels, using the method presented in [6]. In this paper the datadriven prediction of accent labels will be discussed. As input parameters partofspeech (POS) tags and phrase break information are used. For fast and easy adaptation to new languages and/or speakers a datadriven approach is favorable. For symbolic accent label prediction prominent datadriven approaches are based on classiﬁcation and regression trees (CARTs) [7]. In [8] feedforward NNs and treebased learning methods have been used to predict word and syllable prominence. The mentioned approaches are based on previously ﬁxed tree or NNstructures. In our approach structures can be adapted depending on the actual input to the NN. Our system works on sentence level, i.e. our goal is to generate a neutral prosody for each sentence as if the sentence was read isolated from surrounding sentences. Further, we determine context windows. The context windows are shifted across each sentence. A problem occurs at sentence boundaries: at sentence beginning no left context is available and at sentence end no right G. Dorﬀner, H. Bischof, and K. Hornik (Eds.): ICANN 2001, LNCS 2130, pp. 57–64, 2001. c SpringerVerlag Berlin Heidelberg 2001
58
Achim F. M¨ uller and Hans Georg Zimmermann
context is available. By applying gating clusters within the presented NN architecture, a dynamic adaptation of the network structure is proposed to handle the variable context information within the context windows. The architecture is embedded in a causal retrocausal NN as presented in [10]. In this framework timeinterdependence of variables can be captured. This way the complex dynamical process of accent label prediction can possibly be modeled more accurate than with treebased methods or simple feedforward NNs. The paper is organized as follows: In section 2 the applied causal retrocausal NN structure is explained and the problem encountered at sentence boundaries when applying this structure is described. In section 3 a solution to this problem is presented by applying gating clusters that can be used for dynamic network adaptation. In section 4 a method is presented to ﬁnd an optimal context window length datadriven and results are discussed. Finally, in section 5 a conclusion of the presented work is given. ytl
yt−1 A
NNl
xtl
A
A NN−1
xt−1
yt1
yt0
A
A NN0
A
xt0
ytr A
NN1
xt1
A
NNr
xtr
Fig. 1. Block diagram of a causal retrocausal NN.
2
Causal and Retrocausal NN
Figure 1 shows the block diagram of a causal retrocausal NN [10]. As can be seen the system is build up by connecting subnetworks NNi , i ∈ {l, . . . , r}, l < 0, r > 0. Each subnetwork is associated with one time step. The connection of these subnetworks is realized by shared weight matrices A and A (shared weights means to use the same set of weights in each connector denoted by the same upper case bold face letter). A propagates the causal information ﬂow (from past to future) and A propagates the retrocausal information ﬂow (from future to past). Within the architecture of ﬁgure 1 shared weight matrices allow a symmetric handling of past and future information. In our application past refers to left context and future refers to right context. As can be seen in ﬁgure 1, each NNi has a vector xti as input and a vector yti as output. xti ∈ Rm , ∀i and yti ∈ Rn , ∀i, i.e. the input and output vectors for each subnetwork NNi are of the same dimension. The input to the whole system is given by a sequence of input vectors xtl , . . . , xt0 , . . . , xtr and the output is given by a sequence of output vectors ytl , . . . , yt0 , . . . , ytr .
Symbolic Prosody Modeling by Causal Retrocausal NNs
59
In our application the input vectors xtl , . . . , xt0 , . . . , xtr of ﬁgure 1 represent a sequence of feature vectors. Feature vectors are calculated on word level, i.e. each time step ti in ﬁgure 1 is associated with one word. A feature vector contains the information mentioned in section 1, i.e. POS information and phrase break information. Output vectors are also determined on word level and contain the accent information for the word. Both, input and output information are of nominal structure and are therefore coded with 1outofk codes. As mentioned in section 1 our system works on sentence level, i.e. no sentence interdependencies are modeled. Therefore, a problem occurs at sentence boundaries: At the beginning (end) of a sentence no left (right) context is available, i.e. no deﬁned input can be determined for the ﬁrst (last) l (r) feature vectors xti and the ﬁrst (last) l (r) output vectors yti of the input/output sequence. The broader the left and right context, i.e. the greater l and r, the more feature vectors xti and output vectors yti are not deﬁned. If the used network structure is static, i.e. the structure cannot be adapted dynamically at runtime, a well deﬁned input signal must be determined for each input neuron. In ﬁgure 1 this means an input vector xti must be determined for each subnetwork NNi . This problem can be overcome by dynamic adaptation of the network structure as presented in the next section.
3
Dynamic Adaptation of Causal Retrocausal NN
In the ﬁrst part of this section gating clusters are presented that can in principle be applied within any NN architecture for dynamic adaptation. In the second part it is explained how gating clusters are applied within our causal retrocausal NN. 3.1
Gating Clusters
When applying NNs it is generally desirable to design a problem adapted network structure, i.e. to incorporate prior information into the neural network design [4]. The better this adaptation, the better a process can possibly be modeled. Adaptation can e.g. mean optimization of the number of hidden layers or diﬀerent clustering of neurons. These adaptations are all static, i.e. the adaptations are ﬁxed prior to runtime of the system. Gating clusters enable an adaptation of the network structure at runtime, i.e. a dynamic adaptation. Figure 2 illustrates how gating clusters can be applied1 . Gating clusters are applied at the output of a hidden cluster. On the left hand side of ﬁgure 2 a cut out hidden cluster of a NN without a gating cluster is displayed. The cluster is connected with the rest of the network (denoted by NN) by the weight matrices 1
Remark on notation of NN in this paper: ellipses denote a cluster of neurons and lower case bold face letters denote the output vector of such a cluster of neurons. The dimension of an output vector is thus equal to the number of hidden neurons in the cluster. Upper case bold face letters denote weight matrices.
60
Achim F. M¨ uller and Hans Georg Zimmermann NN
NN W2 gating cluster, xhid ,g(·)
W2
id xg ∈ Rh , f (x)
xhid ∈ Rh
id xhid ∈ Rh
W1
W1
NN
NN
Fig. 2. Left: no gating cluster is used. Right: the gating cluster controls the output signal of hidden cluster xhid .
W1 and W2 . The cluster contains h neurons and the overall output from all neurons is given by vector xhid ∈ Rh . On the right hand side of ﬁgure 2 the same hidden cluster associated with the output signal vector xhid as on the left hand side of the ﬁgure can be found. The input to the hidden cluster did not change. The output signal vector xhid , however, is now propagated to a gating cluster via the weight matrix id, identity matrix, id ∈ Rh×h . The gating cluster is controlled by a gating vector xg ∈ Rh . The transfer function g(xg , xhid ) (denoted by g(·) in ﬁgure 2) of the gating cluster is deﬁned by xhid,1 xg,1 · xhid,1 xg,1 .. (1) g ... , ... = . . xg,h
xhid,h
xg,h · xhid,h
The output of the gating cluster given by xhid = g(xg , xhid ) is propagated to the rest of the network via the same weight matrix W2 as on the left hand side of ﬁgure 2. If xg,i ∈ {0, 1}, i = 1, . . . , h, then the gating cluster can be used to gate (fade out) the output of neurons of the hidden cluster. If xg = f (x) (as indicated in ﬁgure 2), i.e. xg is a function of the input vector x to the NN, then gating clusters can be used as an elegant way to adapt the network structure dynamically. If whole clusters of neurons need to be faded out for some x, f (x) must be chosen such that xg = [0, . . . , 0]T . For xg = [0, . . . , 0]T no signal of xhid is propagated to the rest of the network. Further, during learning weight adaptation in W1 and W2 is prevented since no error signal can pass the gating cluster. Thus, the network behaves as if the cluster xhid did not exist. For xg = [1, . . . , 1]T the gating cluster has no eﬀect, i.e. the network can be simpliﬁed as displayed on the left hand side of ﬁgure 2.
Symbolic Prosody Modeling by Causal Retrocausal NNs
61
yti = zti zti ∈ Rn , g(·) id
gating cluster
id
xg,ti ∈ Rn , f (xti )
zti ∈ Rn , tanh C
C
A
A
sti ∈ Rf , tanh
A
A
sti ∈ Rp , tanh B
B
xti ∈ Rm , id NNi−1
xti
NNi
NNi+1
Fig. 3. Subnetwork NNi using a gating cluster.
3.2
Embedding in Causal Retrocausal NN
The problem encountered at sentence boundaries (cf. section 2) can be overcome by dynamic adaptation of the network structure depending on the sequence length of deﬁned input vectors xti , i.e. to shorten the sequence of NNi when necessary. This can be achieved by choosing subnetworks NNi using gating clusters as displayed in ﬁgure 3. For a deﬁned xti , we choose f (xti ) = xg,ti = [1, . . . , 1]T
.
(2)
As a consequence yti = zti = zti . This means an output of NNi is generated and during training the error signal is backpropagated to NNi . This in turn leads to weight adaptation within NNi and the rest of the network (as can be seen, the error signal is propagated to the rest of the network via C → A and i. C → A , respectively). The rest of the network means all NNj with j = For an undeﬁned xti we let f (xti ) = xg,ti = 0
and xti = 0
(3)
This way there are no signals from NNi to the rest of the network: The gating cluster prevents the backpropagation of an error signal. Therefore, weight adaptation according to an invalid error signal is prevented. Further, by choosing xti = 0 no signals are propagated to the rest of the network via B → sti → A and B → sti → A respectively.
62
Achim F. M¨ uller and Hans Georg Zimmermann
The state transition equations for the complete NN architecture (cf. ﬁgures 1, 3) are given by:
4
sti = tanh(Asti−1 + Bxti ) sti = tanh(A sti+1 + B xti )
(4) (5)
yti = Csti + C sti
(6)
Experiments and Results
The accent labeling scheme used to label our training and testing corpus [2] for the German language is the same as in [1], i.e. perceptualacoustic labels are assigned to each word by listening only. The corpus contains 1000 sentences (21549 words) taken from the German newspaper Frankfurter Allgemeine Zeitung. As in [7], a professional radio news speaker was chosen for the recordings, since news reading is our target domain. For our experiments, we adopt the strategy of [1] and [5] and only distinguish between words perceived accented and words perceived deaccented. 4.1
Optimization of the Number of Time Steps
When applying the presented NN architecture an important choice is the size of the context window, i.e. the size of l and r in ﬁgure 1. l determines the maximum number of past time steps (left context) and r determines the maximum number of future time steps (right context). In the following, a method is presented to determine l and r datadriven using the proposed NN architecture (ﬁgures 2 and 3). In [9] the following method is presented to detect how many past time steps contribute information to a causal NN, i.e. no A connections in ﬁgure 1: For l past time steps the error is observed for each output signal ytl , ytl+1 , . . . , yt0 . If eti denotes the error value for yti then one should be able to observe a declining error from past to present time steps, i.e. etl > etl+1 > . . .. This has the following reason: ytl is only computed with xtl as input. For the next output ytl+1 information from xtl+1 and additionally from xtl is available (information from xtl is transmitted via A, see ﬁgure 1). This superposition of more and more information should lead to etl > etl+1 > . . .. If this is not the case for all time steps, i.e. etj ≤ etj+1 at some time step tj , j ∈ {l, . . . , −1}, then input xtj+1 did not contribute relevant information to the model and the maximal intertemporal connectivity (MIC) [9] is reached, i.e. inputs longer ago in time do not add information and can be omitted. In our application the concept of [9] is applied to ﬁnd MIC for past inputs (left context). Further, the concept is extended symmetrically to analyze MIC for future inputs (right context). Figure 4 displays the results for MIC analysis for left and right context. The two plots of ﬁgure 4 result from independent experiments.
Symbolic Prosody Modeling by Causal Retrocausal NNs 0.49
0.49 0.48 0.47 0.46
eti 0.45
declining error MIC
0.48
declining error
0.47
MIC
63
0.46
eti 0.45
0.44
0.44
0.43
0.43 0.42
0.42 654321 0
i
0 1 2 3 4 5 6
i
Fig. 4. MIC analysis for left and right context.
In the ﬁrst experiment (related to the left hand side of ﬁgure 4) a causal NN using gating clusters is used to determine MIC for the left context. For the experiment a temporal context of six was used. In ﬁgure 1 this means no A connections are used and l = −6 and r = 0. The plot on the left hand side of ﬁgure 4 shows the error values eti of the output vectors yti . As can be seen the error declines for −6 ≤ i ≤ −3 and rises for i > −3. In the second experiment (related to the right hand side of ﬁgure 4) a retrocausal NN using gating clusters is used to determine MIC for the right context. For the experiment a temporal context of six time steps was used. In ﬁgure 1 this means no Aconnections are used and l = 0 and r = 6. The plot on the right hand side of ﬁgure 4 shows the error values eti of the outputs yti . As can be seen the error declines (now from future to past time steps, i.e. from right to left) for 2 ≤ i ≤ 6 and rises for i < 2. MIC is thus reached for a right context of four. This demonstrates that MIC analysis can be extended symmetrically for future context analysis as proposed. The two experiments are used to determine the optimal size of a context window. The context window is chosen such that the two time steps where MIC was determined join at t0 . Thus, the lowest error and highest prediction accuracy is found at t0 . From the experiments related to ﬁgure 4 this is achieved by choosing a left context of three time steps and a right context of four time steps. In ﬁgure 1 this means l = −3 and r = 4. 4.2
Prediction Accuracy
For training and testing, the above described corpus (German language) and the corpus used in [7] (English language) were used. Both corpora have been separated into three subsets which contain approximately the following percentage of data: training set: 70%, validation set: 10% and test set (independent testing data): 20%. All results reported below are determined on the test set. The validation set is used to avoid overﬁtting to the training data, i.e. training is stopped when a rising error is observed on the validation set (early stopping). Prediction accuracy for the proposed NN architecture (ﬁgures 1, 3) was measured at 84.5% for the English corpus used in [7]. In [7] prediction accuracy
64
Achim F. M¨ uller and Hans Georg Zimmermann
for accents on word level obtained with a CARTbased approach is reported at 82.5%. The comparison of the results show an improvement for the proposed method. This improvement can be seen as signiﬁcant because of the large test set. In a second experiment the above described German corpus was used. Prediction accuracy for this corpus was measured at 83.1%. The German corpus has rougly twice the size of that used in [7]. Thus the test set used has also twice the size of that used in [7] and may contain more complex sentence structures. Therefore, the high prediction accuracy for this large test set demonstrate the good generalization ability of the proposed method.
5
Conclusion
In this paper a new NN architecture for accent label prediction was presented. The architecture applies gating clusters in a causal retrocausal NN. With gating clusters the number of time steps of a causal retrocausal NN can be adapted dynamically. It was shown how the architecture can be used to determine an optimal size for a context window. With the optimized context window size a prediction accuracy of 84.5% was achieved for an English corpus and a prediction accuracy of 83.1% for a German corpus. This demonstrates the good performance of the proposed method.
References 1. A. Batliner, M. Nutt, V. Warnke, E. N¨ oth, J. Buckow, R. Huber, and H. Niemann. Automatic annotation and classiﬁcation of phrase accents in spontaneous speech. In Eurospeech, 1999. 2. Institut f¨ ur Phonetik und sprachliche Kommunikation. Siemens synthese korpus si1000p. corpus available at http://www.phonetik.unimuenchen.de/Bas/. 3. Ralf Haury and Martin Holzapfel. Optimization of a neural network for speaker and task dependent f0generation. In ICASSP, 1998. 4. Simon Haykin. Neural Networks — A Comprehensive Foundation, chapter 1.7 — Knowledge Representation. Prentice Hall International, 1999. 5. Julia Hirschberg. Pitch accent in context: Predicting prominence from text. Artiﬁcial Intelligence, 63:305–340, 1993. 6. Achim F. M¨ uller, Hans G. Zimmermann, and R. Neuneier. Robust generation of symbolic prosody by a neural classiﬁer based on autoassociators. In ICASSP, 2000. 7. K. Ross and M. Ostendorf. Prediction of abstract prosodic labels for speech synthesis. Computer Speech and Language, 10:155–185, 1996. 8. Christina Widera, Thomas Portele, and Maria Wolters. Prediction of word prominence. In Eurospeech, 1997. 9. Hans G. Zimmermann, R. Neuneier, and R. Grothmann. Modeling and Forecasting Financial Data, Techniques of Nonlinear Dynamics, chapter Modeling of Dynamic Systems by Error Correction Neural Networks. Kluwer Academic, 2000. 10. Hans Georg Zimmermann, Achim F. M¨ uller, C ¸ a˘ glayan Erdem, and R¨ udiger Hoﬀmann. Prosody generation by causal retrocausal error correction neural networks. In Workshop on MultiLingual Speech Communication, Advanced Telecommunications Research Institute International (ATR), 2000.
Discriminative Dimensionality Reduction Based on Generalized LVQ Atsushi Sato Multimedia Research Labs., NEC Corporation 11, Miyazaki 4chome, Miyamaeku, Kawasaki, Kanagawa 2168555, Japan
[email protected] Abstract. In this paper, a method for the dimensionality reduction, based on generalized learning vector quantization (GLVQ), is applied to handwritten digit recognition. GLVQ is a general framework for classiﬁer design based on the minimum classiﬁcation error criterion, and it is easy to apply it to dimensionality reduction in feature extraction. Experimental results reveal that the training of both a feature transformation matrix and reference vectors by GLVQ is superior to that by principal component analysis in terms of dimensionality reduction.
1
Introduction
One of the important roles of feature extraction is to map the data into a lower dimensional space. In general, such a reduction in dimensionality will be accompanied by a loss of some information which discriminates between diﬀerent classes. However, feature extraction leads to improve performance, because we can reduce noise and the curse of dimensionality as well as incorporating prior knowledge. The goal in dimensionality reduction is therefore to preserve as much of the relevant information to classiﬁcation as possible. Principal component analysis (PCA), in which a data set in d dimensional space is assumed to have an intrinsic dimensionality q < d, is often used for dimensionality reduction [1]. Fisher discriminant analysis (FDA), in which the ratio of betweenclass scatter to withinclass scatter is maximized, can also be applied to dimensionality reduction [1]. It is noteworthy that we are unable to ﬁnd more than K − 1 bases in FDA; K is the number of classes, because of the rank deﬁciency of the betweenclass scatter matrix. FDA is thus not applicable to dimensionality reduction in case of a small number of classes such as digit recognition. In recent years, independent component analysis (ICA), in which the data are projected onto nonorthogonal independent bases, has been investigated [2,3]. However, ﬁnding independent bases in the high dimensional space is still a hard problem, so we have to reduce dimensionality with PCA or FDA before using ICA. The common problem with the above methods is that the optimality of classiﬁcation is not guaranteed, because the kinds of classiﬁers that are used are not G. Dorﬀner, H. Bischof, and K. Hornik (Eds.): ICANN 2001, LNCS 2130, pp. 65–72, 2001. c SpringerVerlag Berlin Heidelberg 2001
66
Atsushi Sato
taken into account. Discriminative feature extraction (DFE) is a framework for solving this problem [4]. In DFE, the feature extraction module can be trained, based on the minimum classiﬁcation error (MCE) criterion [5], for a given classiﬁer. The author has proposed a new form of discriminative training, based on MCE, called generalized learning vector quantization (GLVQ) [6,7,8]. In recent years, GLVQ’s eﬀectiveness has been evaluated in character recognition [9,10]. GLVQ can be applied to feature extraction as well as to classiﬁcation, and the eﬀectiveness of combining FDA and GLVQ has been evaluated in Chinese character recognition [11]. In this paper, the combination of PCA and GLVQ is applied to the recognition of handwritten digits, and a feature transformation matrix is trained as well as reference vectors on the basis of GLVQ. Experimental results for digit recognition reveal that, in terms of dimensionality reduction, the training of both the feature transformation matrix and reference vectors by GLVQ is superior to that by PCA. A detailed introduction of MCE and GLVQ is provided in Sec. 2, and dimensionality reduction on the basis on GLVQ is presented in Sec. 3. Experimental results for digit recognition and a discussion are given in Sec. 4. Finally, some conclusions are given in Sec. 5.
2
Minimum Classiﬁcation Error
MCE is a criterion for discriminative training that minimizes the overall risk in Bayes decision theory by using a gradient descent procedure [12,5], which can be applied to any other classiﬁer structures. In this section, a brief introduction to Bayes decision theory is provided, and we then explain the formulation for classiﬁer design on the basis of MCE. 2.1
Bayes Decision Theory
According to Bayes decision theory, the classiﬁer should be designed so that it minimizes the overall risk ( or the expected loss ) [13], which is deﬁned by R=
K
(α(x)ωk )P (ωk x)p(x)dx,
(1)
k=1
where K denotes the number of classes, p(x) is the probability density of an input vector x, P (ωk x) is the a posteriori probability, and (α(x)ωk ) is a loss for the decision α(x) for a given class ωk . If (α(x)ωk ) is a zeroone function such that (ωj ωk ) = 1 − δjk , the expected loss can be written as R = 1 − P (ωj x)p(x)dx. (2) If we adopt the MAP rule; α(x) = ωj if P (ωj x) = maxk P (ωk x) , the second term on the right of Eq.(2) has the maximum, and then the expected
Discriminative Dimensionality Reduction Based on Generalized LVQ
67
loss is minimized. Since the a posteriori probability can be written as follows, according to Bayes formula: P (ωk x) =
p(xωk )P (ωk ) , p(x)
(3)
the estimation of the probability density p(xωk ) is one of the central problems of statistical pattern recognition. 2.2
Generalized Probabilistic Descent
Since the estimation of probability density is diﬃcult because of the curse of dimensionality, one good way of designing the classiﬁer is to minimize the expected loss directly without densities estimating. If we assume that p(x) = 1 n δ(x − xn ) and P (ωk x) = 1(x ∈ ωk ), the following empirical loss can be N obtained: N K 1 Re = (α(xn )ωk )1(xn ∈ ωk ), (4) N n=1 k=1
where N is the number of training samples, and 1(·) is an indicator function such that 1(true) = 1 and 1(f alse) = 0 . Minimum classiﬁcation error (MCE) is to directly minimize the empirical loss by using a gradient search procedure. To make the empirical loss diﬀerentiable, a continuous loss function is employed instead of the zeroone function; (α(x)ωk ) = (ρk (x; θ)), where ρk (x; θ) is called the misclassiﬁcation measure, and θ denotes the classiﬁer’s parameter. In the formulation called generalized probabilistic descent (GPD) [5], the loss function (·) is deﬁned by (ρ) =
1 , 1 + e−ξρ
(5)
where ξ is a positive number. For the misclassiﬁcation measure, a diﬀerentiable one is employed as follows: 1/η 1 ρk (x; θ) = −gk (x; θ) + gi (x; θ)η , (6) K −1 i=k
where η is a positive number and gk (x; θ) denotes the discriminant function of a class ωk . When there are two classes or when η → ∞, Eq. (6) becomes simply ρk (x; θ) = −gk (x; θ) + gl (x; θ). Here, gl (x; θ) = maxi=k gi (x; θ). Obviously, ρk (x; θ) > 0 implies misclassiﬁcation and ρk (x; θ) < 0 means a correct decision for a given x ∈ ωk . 2.3
Generalized Learning Vector Quantization
If we apply GPD to minimumdistance classiﬁers, the same update rule for reference vectors as is used in learning vector quantization (LVQ) can be obtained.
68
Atsushi Sato
However, this approach has a drawback in that the reference vectors diverge in the same way as in LVQ2.1. This problem arises because the solution which minimizes Re is at inﬁnity. To solve this problem, the following misclassiﬁcation measure was introduced in generalized learning vector quantization (GLVQ) [6,7,8]: −gk (x; θ) + gl (x; θ) . (7) ρk (x; θ) = gk (x; θ) + gl (x; θ) Moreover, ξ in Eq. (5) increases with time to approximate the zeroone loss function. This is because the minimization of Re is not equivalent to the minimization of the average probability of error as long as ξ < ∞. The empirical loss can be minimized directly by a gradient descent procedure as follows: ∂Re θ(t + 1) = θ(t) − (t) (8) ∂θ θ =θ (t) where (t) is some small positive number in the time domain. According to probabilistic descent, the derivative of Re is given by K
∂(ρk (xn ; θ)) ∂Re =U 1(xn ∈ ωk ). ∂θ ∂θ
(9)
k=1
If U is a positive deﬁnite symmetric matrix, that is xT U x > 0 and U T = U , it is guaranteed that the expectation of Re decreases [12]. Moreover, if (t) satisﬁes the following conditions, it is ensured by the RobbinsMonro method that θ converges to the solution which minimizes Re : lim (t) = 0,
t→∞
3 3.1
∞
(t) = ∞,
t=1
∞
(t)2 < ∞.
(10)
t=1
Dimensionality Reduction Principal Component Analysis
Principal component analysis (PCA) is commonly used for dimensionality reduction. The input vector x ∈ Rd is transformed into y ∈ Rq (q < d) in the following way: y = A(x − µ), (11) N where µ is the mean of all training samples; µ = N1 n=1 xn . The q ×d transformation matrix A is deﬁned by A = (φ1 , · · · , φq )T , in which φi is the eigenvector that corresponds to a given eigenvalue λi (λi > λi+1 ) of the covariance matrix: Σ=
N 1 (xn − µ)(xn − µ)T . N n=1
(12)
Since there will be just a few large eigenvalues, q is regarded as the intrinsic dimensionality of the subspace that governs the data.
Discriminative Dimensionality Reduction Based on Generalized LVQ
3.2
69
GLVQ Training
GLVQ can be applied to any classiﬁer structure, but we apply it to the minimumdistance classiﬁer in this paper. The discriminant function is given by gk (x; θ) = −dk (x; θ) = y − mki 2 ,
(13)
where mki is the ith reference vector in class ωk , which is the nearest one to the transformed input vector y given by Eq. (11). The misclassiﬁcation measure is then written as dk (x; θ) − dl (x; θ) ρk (x; θ) = , (14) dk (x; θ) + dl (x; θ) where dk (x; θ) = y − mki 2 and dl (x; θ) = y − mlj 2 . If we employ the unit matrix for U , Eq. (9) can be written for x ∈ ωk as follows: ∂Re ∂(ρk (x; θ)) = . ∂θ ∂θ
(15)
Since the parameter θ is a set of reference vectors and the transformation matrix, the followings can be obtained: ∂Re dl (x; θ) = −4 (ρk (x; θ)) (y − mki ), ∂mki {dk (x; θ) + dl (x; θ)}2
(16)
∂Re dk (x; θ) = +4 (ρk (x; θ)) (y − mlj ), ∂mlj {dk (x; θ) + dl (x; θ)}2
(17)
∂Re =0 ∂m
(18)
∂Re =− ∂A
for m = mki or m = mlj ,
∂Re ∂Re + ∂mki ∂mlj
(x − µ)T .
(19)
Note that the denominator {dk (x; θ) + dl (x; θ)}2 in Eqs. (16) (17) should be replaced with dk (x; θ) + dl (x; θ) to be robust with respect to the initial values of vectors [8]. Moreover, we should employ a constraint such that qthe reference 2 φ = const to prevent divergence of A. The update of A can be written i=1 i as ∂Re (20) δA = −(t) = {−αki (y − mki ) + αlj (y − mlj )} (x − µ)T , ∂A where αki > 0 and αlj > 0. It is easy to understand the above equation if we apply (x − µ) to it from the right. As shown in Fig. 1, the transformed vector y and the reference vector mki that belong to the same class approach each other, while y and mlj that belong to diﬀerent classes diverge from each other.
70
Atsushi Sato A(x
0 )
mlj
A(x
0 )
mki
y
mlj
y
mki (a)
(b)
Fig. 1. Update of transformation matrix and reference vectors. Table 1. Error rates for test samples. Dimension q 2 4 8 16 32 64 128 400
4
Error Rates (%) Type A Type B Type C 53.76 53.10 12.06 24.02 19.84 1.42 10.82 3.78 0.68 7.68 1.44 0.56 7.42 0.90 0.54 7.26 0.54 0.52 7.26 0.52 0.50 7.26 0.50 0.50
Handwritten Digit Recognition Experiments
The dimensionality reduction method based on GLVQ was applied to handwritten digit recognition to evaluate its performance. An isolated character database assembled in our laboratory was used in the experiments. The sample size was 1000 per class and was divided into 500 training samples and 500 test samples per class. 400dimensional feature vectors were extracted from each image by using the weighted direction code histogram method [14]. One reference vector was assigned to each class, and these reference vectors were used with a minimumdistance method to classify the transformed vectors. The following three combinations of dimensionality reduction and classiﬁcation were evaluated in the experiments: – Type A : transformation matrix obtained by PCA, and mean vectors, – Type B : transformation matrix obtained by PCA, and reference vectors trained by GLVQ, and – Type C : transformation matrix and reference vectors trained simultaneously by GLVQ. Type A was used to obtain initial values for training in both Type B and Type C. The dimensionality of reference vectors was reduced from the original 400 to 2, and training and evaluation were carried out for each run. Table 1 shows the error rates for test samples, and these results imply that
10
10
5
5 2nd principal component
2nd principal component
Discriminative Dimensionality Reduction Based on Generalized LVQ
0
0
5
5
10
10 10
5
0 5 1st principal component
(a) PCA
10
71
10
5
0 5 1st principal component
10
(b) Type C
Fig. 2. Feature transformation into two dimensional space.
– the training of reference vectors by GLVQ is eﬀective in improving recognition accuracy even though the dimensionality was reduced by PCA, – the training of both the transformation matrix and the reference vectors further improves the recognition accuracy for every number of features, and – the recognition accuracy declines with the dimensionality. Figure 2 shows how the feature vectors were mapped to the space of lower dimension. All training samples were mapped to the 2dimensional space by PCA on the left, and by Type C on the right, respectively. The horizontal axis and the vertical axis denote the values projected to the 1st and 2nd principal components, respectively. In the ﬁgure on the right, we can ﬁnd ten clusters that correspond to each numeral, and the decision border can be approximated by the Voronoi diagram. This implies that the transformation matrix was designed to be appropriate to minimumdistance classiﬁers. Note that the bases trained by GLVQ are no longer orthogonal to each other.
5
Conclusion
In this paper an application of the discriminative dimensionality reduction based on generalized learning vector quantization (GLVQ) to handwritten digit recognition was described. The experimental results showed that the training of both the feature transformation matrix and the reference vectors by GLVQ is superior to principal component analysis in terms of dimensionality reduction. This implies that the transformation matrix designed by this method was appropriate to a given classiﬁer. However, the recognition accuracy declined with the dimensionality in the experiments. This is a problem for further investigation. Studies on the discriminative training of nonlinear feature transformation also remains as an interesting topic for future work.
72
Atsushi Sato
References 1. K. Fukunaga, “Introduction to Statistical Pattern Recognition,” 2nd ed., San Diego: Academic Press, 1990. 2. C. Jutten and J. Herault, “Blind Separation of Sources,” Signal Processing, Vol. 24, pp. 1–10, 1991. 3. A. Hyv¨ arinen and E. Oja, “A Fast FixedPoint Algorithm for Independent Component Analysis,” Neural Computation, Vol. 9, No. 7, pp. 1483–1492, 1997. 4. A. Biem, S. Katagiri and B.H. Juang, “Pattern Recognition using Discriminative Feature Extraction,” IEEE Trans. on Signal Processing, Vol. 45, No. 2, pp. 500–504, 1997. 5. B.H. Juang and S.Katagiri, “Discriminative Learning for Minimum Error Classiﬁcation,” IEEE Trans. on Signal Proc., Vol. 40, No. 12, pp. 3043–3054, 1992. 6. A. Sato and K. Yamada, “Generalized Learning Vector Quantization,” In Advances in Neural Information Processing Systems 8, pp. 423–429, MIT Press, 1996. 7. A. Sato and K. Yamada, “An Analysis of Convergence in Generalized LVQ,” In Proc. of the Int. Conf. on Artiﬁcial Neural Networks, Vol. 1, pp. 171–176, 1998. 8. A. Sato, “An Analysis of Initial State Dependence in Generalized LVQ,” In Proc. of the Int. Conf. on Artiﬁcial Neural Networks, Vol. 2, pp. 928–933, 1999. 9. C.L. Liu and M. Nakagawa, “Prototype Learning Algorithms for Nearest Neighbor Classiﬁer with Application to Handwritten Character Recognition,” In Proc. of the Int. Conf. on Document Analysis and Recognition, pp. 378–381, 1999. 10. T. Fukumoto, T. Wakabayashi, F. Kimura and Y. Miyake, “Accuracy Improvement of Handwritten Character Recognition by GLVQ,” In Proc. of the Int. Workshop on Frontiers in Handwriting Recognition, pp. 271–280, 2000. 11. M.K. Tsay, K.H. Shyu and P.C. Chang, “Feature Transformation with Generalized Learning Vector Quantization for HandWritten Chinese Character Recognition,” IEEE Trans. on Information and Systems, Vol. E82D, No, 3, pp. 687–692, 1999. 12. S. Amari, “A Theory of Adaptive Pattern Classiﬁers,” IEEE Trans. Elec. Comput., Vol. EC–16, No. 3, pp. 299–307, 1967. 13. R.O. Duda and P.E. Hart, “Pattern Classiﬁcation and Scene Analysis,” John Wiley & Sons, 1973. 14. F. Kimura, T. Wakabayashi, S. Tsuruoka and Y. Miyake, “Improvement of Handwritten Japanese Character Recognition Using Weighted Direction Code Histogram,” Pattern Recognition, Vol. 30, No. 8, pp. 1329–1337, 1997.
A Computational Intelligence Approach to Optimization with Unknown Objective Functions Hirotaka Nakayama1 , Masao Arakawa2 , and Rie Sasaki1 1
Konan University, Dept. of Information Science and Systems Engineering, Kobe 6588501, JAPAN
[email protected] 2 Kagawa University, Dept. of Reliabilitybased Information Engineering, Kagawa 7610396, JAPAN
[email protected] Abstract. In many practical engineering design problems, the form of objective function is not given explicitly in terms of design variables. Given the value of design variables, under this circumstance, the value of objective function is obtained by some analysis such as structural analysis, ﬂuidmechanic analysis, thermodynamic analysis, and so on. Usually, these analyses are considerably time consuming to obtain a value of objective function. In order to make the number of analyses as few as possible, we suggest a method by which optimization is performed in parallel with predicting the form of objective function. In this paper, radial basis function networks (RBFN) are employed in predicting the form of objective function, and genetic algorithms (GA) in searching the optimal value of the predicted objective function. The eﬀectiveness of the suggested method will be shown through some numerical examples.
1
Introduction
Our aim in this paper is to optimize objective functions whose forms are not explicitly known in terms of design variables. In many engineering design problems under this circumstance, values of objective functions are obtained by some complicated analysis of large scale with multipeaked characters that are often seen in elasticplastic analysis and so on. These analyses usually require considerably expensive computational time. Therefore, if these functions are optimized by existing methods, it takes an unrealistic order of time to obtain a solution. For this situation, the number of necessary analyses should be as few as possible. To this end, we suggest a method consisting of two stages. The ﬁrst stage is to predict the form of objective function by RBFN(Radial Basis Function Networks). The second stage is to optimize the predicted objective function by GA(Genetic Algorithms). A major problem in this method is how to get a good approximation of the objective function based on as few sample data as possible. To this end, the form of objective function is revised by relearning on the basis of additional data step by step. Our discussion will be focused on this additional learning and how to select additional data in the following sections. G. Dorﬀner, H. Bischof, and K. Hornik (Eds.): ICANN 2001, LNCS 2130, pp. 73–80, 2001. c SpringerVerlag Berlin Heidelberg 2001
74
2
Hirotaka Nakayama, Masao Arakawa, and Rie Sasaki
Additional Learning in RBFN
Since the number of sample points for predicting objective functios should be as few as possible, we adopt some additional learning techniques which predict objective functions by adding learning data step by step. RBFN are eﬀective to this end. The output of RBFN is given by f (x) =
m
wj hj (x).
j=1
where hj (j = 1, . . . , m) are radial basis functions, e.g., hj (x) = e−x−cj
2
/rj
.
Given the training data (xi , yˆi ) (i = 1, · · · , p), the learning of RBFN is usually made by solving E=
p
(yˆi − f (xi ))2 +
i=1
m
λj wj2 → Min
j=1
where the second term is introduced for the purpose of regularization. Letting A = (HpT Hp + Λ), we have as a necessary condition for the above minimization ˆ. ˆ = HpT y Aw Here
HpT = [h1 · · · hp ]
where hTj = [h1 (xj ), . . . , hm (xj )], and Λ is a diagonal matrix whose diagonal components are λ1 · · · λm . Therefore, the learning in RBFN is reduced to ﬁnding A−1 = (HpT Hp + Λ)−1 The additional learninig in RBFN can be made by adding new data and/or a basis function, if necesary. Since the learning in RBFN is equivalent to the matrix inversion A−1 , the additional learning here is reduced to the incremental calculation of matrix inversion. The following algorithm can be seen in [7]: (i) Adding a New Training Data Adding a new data xp+1 , the additional learning in RBFN can be made by the following simple update formula: Let Hp Hp+1 = hTp+1 where hTp+1 = [h1 (xp+1 ), . . . , hm (xp+1 )].
A Computational Intelligence Approach to Optimization
Then −1 A−1 p+1 = Ap −
75
T −1 A−1 p hp+1 hp+1 Ap
1 + hTp+1 A−1 p hp+1
(ii) Adding a New Basis Function In cases in which we need a new basis function to improve the learning for a new data, we have the following update formula for the matrix inversion: Let Hm+1 = Hm hm+1 where hTm+1 = [hm+1 (x1 ), . . . , hm+1 (xp )]. Then −1 Am 0 A−1 = m+1 0T 0 −1 T −1 T T 1 Am Hm hm+1 Am Hm hm+1 × + T −1 −1 λm+1 + hTm+1 (Ip − Hm A−1 m Hm )hm+1 Remark It is important to decide parameters ci and ri of radial basis functions moderately. We assign a basis function at each learning data (xi , yˆi ) (i = 1, · · · , p) so that ci = xi (i = 1, · · · , p). Modifying the formula given by [2] slightly, the value of rj is determined by √ r = dmax / n nm where dmax is the maximal distance among the data, n is the dimension of data, m is the number of basis function.
3
How to Select Additional Data
If the current solution is not satisactory, namely if our stop condition is not satisﬁed, we add some data in order to improve the approximation of objective function. Now, how to select such additional data becomes our issue. If the current optimal point is taken as such additional data, the estimated optimal point tends to converge to a local maximum (or minimum) point. This is due to lack of global information in predicting the objective function. On the other hand, if addtional data are taken to be far from the existing data, it is diﬃcult to obtain more detailed information near the optimal point. Therefore, it is hard to obtain a solution with a high precision. This is because of insuﬃcient information near the optimal point. We suggest a method which gives both global information for predicting the objective function and local information near the optimal point at the same time. To this end, we take two kinds of additional data for relearning the form of objective function. One of them is selected from a neighborhood of the current optimal point in order to add local information near the (estimated) optimal point. The size of this neighborhood is controlled during the convergence process.
76
Hirotaka Nakayama, Masao Arakawa, and Rie Sasaki
The other one is selected to be far from the currenct optimal value in order to give a better prediction of the form of objective function. The former additional data gives more detailed information near the current optimal point. The latter data prevents from converging to local maximum (or minimum) point. In this paper, the neighborhood of the current optimal point is given by a square S, whose center is the current optimal point, with the length of a side l. Let S0 be a square, whose center is the current optimal point, with the ﬁxed length of a side l0 . The square S is shrinked according to the number Cx of optimal points appeared continuously in S0 in the past. Namely, l = l0 ×
1 Cx + 1
(1)
The ﬁrst additional data is selected inside the square S at random. The second additional data is selected in a part, in which the existing learning data are sparse, outside the square S. A part with sparse existing data may be found as follows: First, a certain number (Nrand ) of data are generated randomly outside the square S. Denote dij the distance between this random data pi (i = 1, . . . , Nrand ) and the existing learning data qj (j = 1, . . . , N ). Select the shortest k distance d˜ij (j = 1, . . . , k) for each pi , and sum up these k distances, k i.e., Di = j=1 d˜ij . Take pt which maximizes {Di }(i=1,...,Nrand ) as an addtional data outside S. The algorithm is sumarized as follows: Step 1: Predict the form of objective function by RBFN on the basis of given training data. Step 2: Estimate an optimal point for the predicted objective function by GA. Step 3: Count the number of optimal points appeared continuously in the past in S0 . This number is represented by Cx . Step 4: Terminate the iteration, (i) if Cx is larger than or equal to the given Cx0 a priori, or (ii) if the best value of objective function obtained so far is identical during the last certain numbrer (Cf0 ) of iterations. Otherwise calculate l by (1), and go to the next step. Step 5: Select an additional data near the current optimal value, i.e., inside S. Step 6: Select another additional data outside S in a place in which the density of training data is low as stated above. Step 7: Go to Step.1.
4
Numerical Example
As a practical problem, consider the pressure vessel design problem given by Sandgren (1990), Kannan and Kramer (1994), Qian et al. (1993), Hsu, Sun et al. (1995), Lin Zhang et al. (1995), Arakawa et al. (1997). A cylindrical pressure vessel is capped at both ends by hemispherical heads. The shell is to be made in
A Computational Intelligence Approach to Optimization
77
two halves, of rolled plates which are joined by two longitudinal welds to form a cylinder. Each head is forged and then welded to the shell. All welds are single welded but joints with a backing strip. The material used in the vessel is carbon steel ASME SA203 grade B. The pressure will be oriented such that the axis of the cylindrical shell is vertical. The pressure vessel is to be a compressed air storage tank with a working pressure of 2.068×107 Pa and a minimum volume of 1.229×10−2 m3 . The objective is to minimize the total cost of manufacturing the pressure vessel, including the cost of material and cost of forming and welding. The design variables are R and L, the inner radius and length of the cylindrical selection respectively, and T s and T h are integer multiples of 0.0015875 m, the available thickness of rolled steel plates. The constraints g1 and g2 correspond to ASME limits on the geometry while g3 corresponds to a minimum volume limit. We can summarize our problem as follows: Design variables R : inner radius (continuous) L : length of the cylindrical section (continuous) T s : thickness of rolled steel plate (discrete) T h : thickness of rolled steel plate (discrete) Constraints g1 : minimum shell wall thickness 0.00049022 R ≤ 1.0 Ts g2 : minimum head wall thickness 0.000242316 R ≤ 1.0 Th g3 : minimum volume of tank − 43 π R3 + 21.24 ≤ 1.0 π R2 L Side constraints RLB ≤ R ≤ RU B LLB ≤ L ≤ LU B T sLB ≤ T s ≤ T sU B T hLB ≤ T h ≤ T hU B Objective Minimize f = 37982.2 T s R L + 108506.3 T h R2 + 193207.3 T s2 L + 1210711T s2 R In this example, the objective function is explicitly given in terms of design variables. However, many researchers use this problem as a bench mark, and hence we use the problem in order to see the eﬀectiveness of our method by
78
Hirotaka Nakayama, Masao Arakawa, and Rie Sasaki Table 1. Numerical result (Cx0 = 60, Cf0 = 30) — # of data 1 83 2 143 3 153 4 143 5 139 6 139 7 147 8 131 9 165 10 133 AVERAGE 137.6 σ 21.6
R (m) 1.1515 0.9855 1.1499 0.9700 1.1499 1.0677 1.1499 1.0677 1.0693 1.1515
L (m) 3.5643 5.6505 3.5803 5.9066 3.5803 4.5620 3.5803 4.5620 4.4873 3.5643
Ts (cm) 2.2225 1.9050 2.2225 1.9050 2.2225 2.0638 2.2225 2.0638 2.0638 2.2225
Th (cm) 1.1112 0.9525 1.1113 0.9525 1.1113 1.1113 1.1113 1.1113 1.1113 1.1113
f ($) 6092.1030 5862.2705 6099.2701 5958.4034 6099.2701 6118.7685 6099.2701 6118.7685 6060.3558 6092.1030 6060.0580 83.6750
treating it in such a way that the objective function is not known. The parameters of RBF network are given by λ = 0.01, and the width r is decided by the conventional form. For optimizing the predicted objective function, genetic algorithms may be applied due to simplicity of their algorithms and ability of getting an approximate global optimal solution. Although several sophisticated methods have been developed so far (e.g., Arakawa et al.[1]), a simple GA is adopted in this example because our main concern in this paper is rather in how well our method can approximate the objective function on the basis of as few data as possible: In our GA, continuous variables are transformed into binary ones of 11 bit. On the other hand, discrete varibles are transformed into binary ones of 4 bit; the population is 80; the generation 100; the rate of mutation is 10%. The constraint functions are penalized by F = f (x) +
3
pj × [P (gj (x))]a ,
i=1
where we set pj = 100(penalty coeﬃcient), a = 2(penalty exponent) and P (y) = max{y, 0}. Our problem is to minimize the augmented objective function F . The data for initial learning is 17 points at the corners of the region given by the side constraints and its center. Finally, each side of the rectangle controlling the convergence and the neighborhood of the point under consideration is half the upper bound  the lower bound of each side constraint. One of results of numerical simulation for various stop conditions are given in Table 1. Fig. 1 shows the convergence process in the second trial of Table 1. The black line shows the best value obtained so far, while the gray line shows the optimal value of the predicted objective function. Table 2 shows the comparison among existing methods. In this table, the best solution by each method is listed (Sandgren 1990, Kannan and Kramer 1994, Qian et al. 1993, Hsu, Sun et al.
79
Value of objective function
A Computational Intelligence Approach to Optimization
# of training data
Fig. 1. Convergence process Table 2. Comparison among existing methods — Sandgren Qian(GAs) Kannan Lin Hsu Lewis Arakawa suggested meth.
R (m) 1.212 1.481 1.481 N/A 1.316 0.985 0.987 0.986
L (m) 2.990 1.132 1.198 N/A 2.587 5.672 5.626 5.651
Ts (cm) Th (cm) 2.858 1.588 2.858 1.588 2.858 1.588 N/A N/A 2.540 1.270 1.905 0.953 1.905 0.953 1.905 0.953
f ($) 8129.80 7238.83 7198.20 7197.70 7021.67 5980.95 5851.78 5862.27
1995, Lin Zhang et al. 1995, Arakawa et al. 1997). Sandgren (1990) and Kannan et al. (1994) use gradient type optimization techniques which treat constraints by penalty function or augmented Lagrangean methods. Lin et al. (1995) uses simulated annealing method and genetic algorithm, and Qian et al. (1993) and Arakawa et al. (1997) genetic algorithms. It can be seen that our suggested method gives a reasonable solution in much fewer analyses than other methods (almost one twenties of genetic algorithms (e.g., Arakawa 1997)).
5
Concluding Remarks
We suggested a method for optimizing objective functions which are not given explicitly in terms of design variables. To this end, RSM(Response Surface Method) is well known (Myers and Montgomery 1995). However, RSM uses polynomial functions (usually, quadratic functions for the simplicity in implementation) for predicting objective functions. This causes a diﬃculty in getting global information, in particular, in highly nonlinear cases. It seems that the suggested method takes advantage because it considers adding global information and lo
80
Hirotaka Nakayama, Masao Arakawa, and Rie Sasaki
cal information simultaneously. Comparative studies among those methods will be subjected to further researches.
References 1. Arakawa, M. and Hagiwara, I. (1997), Nonlinear Integer, Discrete and Continuous Optimization Using Adaptive Range Genetic Algorithms, Proc. of ASME Design Technical Conferences (in CDROM) 2. Haykin, S. (1994), Neural Networks: A Comprehensive Foundation, Macmillan College Publishing Company 3. Hsu, Y.H., Sun, T.L. and Leu, L.H. (1995), A Twostage Sequential Approximation Method for Nonlinear Discrete Variable Optimization, ASME Design Engineering Technical Conference, Boston, MA, 197202 4. Kannan, B.K. and Kramer, S.N. (1994), An Augmented Lagrange Multiplier Based Method for Mixed Integer Discrete Continuous Optimization and Its Appliations to Mechanical Design, Transactions of ASME, J. of Mechanical Design, Vol. 116, 405411 5. Myers, R.H. and Montgomery,D.C. (1995), Response Surface Methodology: Process and Product Optimization using Designed Experiments, Wiley 6. Nakayama, H., Yanagiuchi, S., Furukawa, K., Araki, Y., Suzuki, S. and Nakata, M. (1998), Additional Learning and Forgetting by RBF Networks and its Application to Design of Support Structures in Tunnel Construction Proc. International ICSC/IFAC Symposium on Neural Computation (NC’98), 544550 7. Orr, M.J.L. (1996), Introduction to Radial Basis Function Networks, http://www.cns.ed.ac.uk/people/mark.html 8. Qian, Z., Yu, J. and Zhou, J. (1993), A Genetic Algorithm for Solving Mixed Discrete optimization Problems, DEVol. 651, Advances in Design Automation, Vol. 1, 499503 9. Sandgren, E. (1990), Nonlinear Integer and Discrete Programming in Mechanical Engineering Systems, J. of Mechanical Design, Vol. 112, 223229 10. Zhang, C. and Wang, H.P. (1993), MixedDiscrete nonlinear Optimization with Simulated Annealing, Engineering Optimization, Vol. 21, 277291
Clustering Gene Expression Data by Mutual Information with Gene Function Samuel Kaski, Janne Sinkkonen, and Janne Nikkil¨ a Neural Networks Research Centre Helsinki University of Technology P.O.Box 5400, FIN02015 HUT, Finland {samuel.kaski,janne.sinkkonen,janne.nikkila}@hut.fi
Abstract. We introduce a simple online algorithm for clustering paired samples of continuous and discrete data. The clusters are deﬁned in the continuous data space and become local there, while withincluster differences between the associated, implicitly estimated conditional distributions of the discrete variable are minimized. The discrete variable can be seen as an indicator of relevance or importance guiding the clustering. Minimization of the KullbackLeibler divergencebased distortion criterion is equivalent to maximization of the mutual information between the generated clusters and the discrete variable. We apply the method to a time series data set, i.e. yeast gene expressions measured with DNA chips, with biological knowledge about the functions of the genes encoded into the discrete variable.
1
Introduction
Clustering methods depend crucially on the criterion of similarity. In this work we present a new algorithm which uses additional information to decide the relevance of a dissimilarity in the data. This information is available as auxiliary samples ck ; they form pairs (xk , ck ) with the primary samples xk . In this study the extra information is discrete and consists of labels of functional classes of genes. It is assumed that diﬀerences in the auxiliary data indicate what is important in the primary data space. Intuitively speaking, the auxiliary data guides the clustering to emphasize the (locally) important dimensions of the primary data space, and to disregard the rest. This automatic relevance detection is the main practical motivation for the present work. The distance between two closeby points x and x is measured by the diﬀerences between the distributions of c, given x and x . Our aim is to minimize the withincluster dissimilarities in terms of the (implicitly estimated) distributions p(cx) and still deﬁne the clusters by localized basis functions within the primary space. Put in diﬀerent words, we supervise the metric of the data space to concentrate on the important or relevant aspects of the data. The main goal is datadriven unsupervised exploration of the data in the new metric to make G. Dorﬀner, H. Bischof, and K. Hornik (Eds.): ICANN 2001, LNCS 2130, pp. 81–87, 2001. c SpringerVerlag Berlin Heidelberg 2001
82
Samuel Kaski, Janne Sinkkonen, and Janne Nikkil¨ a
new discoveries of its statistical properties. Here the changes in the distribution of the auxiliary data provide the supervised metric, and the unsupervised task is clustering. The method could perhaps be called semisupervised, or clustering “guided” by the supervisory signal. In an earlier work [8] we constructed explicit estimates for the conditional distributions p(cx) and used them to compute the local distances. In the present work we use an alternative method [11] which does not need an estimate of the conditional distributions p(cx) as an intermediate step. Minimizing the withincluster distortion is equivalent to maximizing the mutual information between the clusters (where cluster membership is interpreted as a value of a multinomial random variable) and the auxiliary data. Maximization of mutual information has been previously used for constructing neural representations [1,2]. Other related works and paradigms include learning from (discrete) dyadic data [12], and distributional clustering [10] with the information bottleneck [13] principle. These methods start from an estimate of the joint distribution p(c, x), which must be available for all discrete x. Our method places and deﬁnes clusters directly in a continuous input space; new data x can then be assigned to clusters even if the auxiliary data is not available. In this work the method is applied to gene expression data, which has previously been clustered with conventional clustering algorithms (see, e.g., [5]), and classiﬁed with supervised methods (Support Vector Machines, [3]). Our goal is to use the known functional classiﬁcation of the genes to implicitly deﬁne which aspects of the expression data are important in the data, and only utilize the important aspects, local factors or dimensions, in clustering. The diﬀerence from ordinary supervised classiﬁcation methods is that while they cannot surpass the original classes our method is not tied to the classiﬁcation and may reveal substructures within and relations between the known functional classes. In this case study we compare our method empirically with alternative methods, and demonstrate the potential usefulness of the results. More detailed biological interpretation of the results will be presented in subsequent papers.
2
Clustering Based on the KullbackLeibler Divergence
We seek to categorize items x of the data space by utilizing the information within a set of samples (xk , ck ) from a random variable pair (X, C). Here x ∈ X ⊂ Rn , and the ck are discrete, multinomial values. We wish to deﬁne the clusters in terms of x only, and to keep the clusters local with respect to x. The distortions minimized by the clustering are, however, measured between distributions in the auxiliary space, although it is not necessary to explicitly compute these distributions. Vector quantization (VQ) or, equivalently, Kmeans clustering, is one approach to categorization. In (“soft”) VQ the goal is to minimize the average distortion E between the data and the prototypes or codebook vectors mj , deﬁned by
Clustering Gene Expression Data
E=
yj (x)D(x, mj ) p(x) dx .
83
(1)
j
Here D(x, mj ) is the distortion between x and mj , and yj (x) is the “soft” cluster membership function that fulﬁlls 0 ≤ yj (x) ≤ 1 and j yj (x) = 1. We measure distortions by the Kullback–Leibler divergence DKL (p, ψ) ≡ i pi log(pi /ψi ), where pi is the multinomial distribution in the auxiliary space that corresponds to the data x, that is, pi ≡ p(ci x). The second distribution is a prototype for the cluster; let us denote the jth prototype by ψj . Although the distortions are measured between distributions of the auxiliary variable, the cluster memberships are still deﬁned in the primary data space, by the membership functions yj (x). In our application the data lies on a hypersphere, and hence we use membership functions that are deﬁned for spheres. The membership functions are normalized von Mises–Fisher (vMF(x; θ j )) kernels [9], spherical analogues of Gaussians, parameterized by their centers θ j (and widths κ): yj (x; θ j ) = vMF(x; θ j )/ k vMF(x; θ k ), where vMF(x; θ j ) = const. × exp(κxT θ j /θ j ). The ﬁnal cost function to be optimized is (1), with yj (x) replaced by the parameterized yj (x; θ j ), and D(x, mj ) replaced by the divergence DKL (p(cx), ψ j ). Connection to mutual information. Denote by V the random variable that indicates which cluster a data sample belongs to. That is, the value of V is vi if the sample belongs to cluster i. It can be shown that our cost function is equal to the (negative) mutual information between V and C, −I(C; V ), up to an additive constant. The algorithm. It can be shown [7] that the average distortion can be minimized with stochastic approximation that samples from yj (x)yl (x)p(ci , x) = p(vj , vl , ci , x). This leads to an online algorithm in which the following steps are repeated for t = 0, 1, . . . with α(t) gradually decreasing towards zero: 1. Draw a data sample (x(t), c(t)). The discrete sample c(t) = ci deﬁnes the value of i in the following steps. 2. Draw two basis functions, j and l, from a multinomial distribution with probabilities {yk (x(t))}k . 3. Adapt the parameters θ l and γlm , m = 1, . . . , Nc , by θ l (t + 1) = θ l (t) − α(t)(x(t) − x(t)T θ l (t)θ l (t)) log γlm (t + 1) = γlm (t) − α(t)(ψlm (t) − δmi ) ,
ψji (t) ψli (t)
(2) (3)
where Nc is the number of possible values of the random variable C, δmi is the Kronecker delta, and the γl are prototype distributions reparameterized to keep their sum at unity: log ψji = γji − log m exp(γjm ). Due to the symmetry between j and l, it is possible to adapt the parameters twice for each t by swapping j and l in (2) and (3) for the second adaptation step. Note that θ l (t + 1) = θ l (t) if j = l.
84
3
Samuel Kaski, Janne Sinkkonen, and Janne Nikkil¨ a
Case Study: Clustering of Gene Expression Data
We clustered 2,467 genes of the budding yeast Saccharomyces cerevisiae by their activity or expression in eight experimental conditions (see [5]), measured as a time series with DNA chips. The total of 79 time points were collected into the feature vector. The data was preprocessed in the same way as in [3], and divided into a training set (two thirds) and a test set (the remaining third). 3.1 Alternative Models We compared our model with two standard stateoftheart mixture density models. The goal of these models is slightly diﬀerent from ours but we use them to provide baseline results. According to our knowledge they are the most closely related existing models. The ﬁrst is a totally unsupervised mixture of von Mises–Fisher distributions. The model is analogous to the usual mixture of Gaussians; the Gaussian mixture components are simply replaced by the von Mises–Fisher components. In the second model, MDA2 [6], the joint distribution between the functional classes c and the expression data x is modeled by a set of additive components denoted by uj : p(ci , x) = p(ci uj )p(xuj )pj , (4) j
where p(ci uj ) and pj are parameters to be estimated, and p(xuj ) = vMF(x; θ j ). Both models are ﬁtted to the data by likelihood maximization with the EM algorithm [4]. 3.2 The Experiments We compared the three models, each with 8 clusters and optimized to near convergence. All models were run three times with diﬀerent random initializations, and the best of the three results was chosen. The quality of the resulting clusterings was measured by the average distortion error or, equivalently, the empirical mutual information. The results in Figure 1 (A) show that our model clearly outperforms the others for a wide range of widths of the kernels and produces the best overall performance. The clusters of our model therefore convey more information about the functional classiﬁcation of the genes than the alternative models. There is a somewhat surprising sideline in the results: The vMF mixture model is slightly better than the MDA2, although it does not utilize the class information at all. The reason is probably in some special property of the data since the likelihood of the paired data (not shown), the measure that the MDA2 optimizes, was larger for MDA2 than for the simple mixture model in which the class distribution in the clusters was estimated afterwards. Moreover, for other data sets the MDA2 has outperformed the plain mixture model. The distribution of some functional subclasses of genes in the clusters is shown in Figure 1 (B). Note that these subclasses were not included in the
Empirical mutual information (bits)
Clustering Gene Expression Data
85
0.25
Cluster number
0.2 0.15 0.1 0.05 00
7
55
(A)
403
2980
Class
12
3
45
6
7
8
a
01
6
01
0
0
0
b
0 1 16
00
0
0
0
c
1 5 39
11
4 14
3
d
03
8
10
0
0
2
e
122 1
0
20
2 44
2
f
33
1 20 0 46
2 12
g
01
0 21 0
0
0
7
22026
(B)
Fig. 1. (A) Empirical mutual information between the generated gene expression clusters and the functional classes of the genes, as a function of parameter κ which governs the width of the basis functions. Solid line: our model, dotted line: mixture of vMFs, dashed line: MDA2. (B) Distribution of genes (learning and test set combined) of sample functional subclasses into the 8 clusters. These subclasses were not used in supervising the clustering. a: the pentosephosphate pathway, b: the tricarboxylicacid pathway, c: respiration, d: fermentation, e: ribosomal proteins, f: cytoplasmic degradation, g: organization of chromosome structure.
values of the auxiliary variable C. Instead, they were picked from the second and third level of the functional classiﬁcation, and they include groups that are known to be regulated in the experimental settings measured by the data [5]. Most of the subclasses are concentrated in one of the clusters. The ﬁrst four (ad) belong to the same ﬁrstlevel class and are placed (mostly) in the same cluster, number 3. Three of the subclasses (c, e, and f) have been clearly divided into more than one cluster, indicating some potentially biologically interesting subdivision of genes inside the subclasses. It remains to be seen in further biological inspection whether such subdivisions provide new biological information; in this paper our goal is to demonstrate that the “semisupervised” clustering approach can be used to explore the data set and provide potential further hypotheses about its structure.
4
Conclusions
We have presented a new clustering method for continuous data which minimizes the withincluster distortion between distributions of associated, discrete auxiliary data [11]. A simple online algorithm was used for optimizing the distortion measure. We clustered a yeast gene expression data set in which auxiliary information was available as an independent, functional classiﬁcation of the genes. In our semisupervised task the algorithm performed better than other algorithms
86
Samuel Kaski, Janne Sinkkonen, and Janne Nikkil¨ a
available for continuous data, the mixture of vMFs and the MDA2 which models the joint density of the expression data and the classes. Note that the number of extracted clusters can be arbitrary and it need not depend on the auxiliary data (here the number of classes). Depending on whether the number of clusters is larger or smaller than the number of classes the relationships of the classes or the substructures within the classes can be studied. Such studies are not possible with supervised methods. It was shown that the obtained clusters mediate information about the function of the genes and, although the results have not yet been biologically analyzed, potentially suggest novel cluster structures for the yeast genes. Acknowledgments. This work was supported by the Academy of Finland, in part by the grant 50061. We wish to thank Petri T¨ or¨ onen for his help with the data and Jaakko Peltonen with some of the programs.
References 1. S. Becker. Mutual information maximization: models of cortical selforganization. Network: Computation in Neural Systems, 7:7–31, 1996. 2. S. Becker and G. E. Hinton. Selforganizing neural network that discovers surfaces in randomdot stereograms. Nature, 355:161–163, 1992. 3. M. P. Brown, W. N. Grundy, D. Lin, N. Cristianini, C. W. Sugnet, T. S. Furey, M. Ares, Jr., and D. Haussler. Knowledgebased analysis of microarray gene expression data by using support vector machines. Proceedings of the National Academy of Sciences, USA, 97:262–267, 2000. 4. A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39:1–38, 1977. 5. M. B. Eisen, P. T. Spellman, P. O. Brown, and D. Botstein. Cluster analysis and display of genomewide expression patterns. Proceedings of the National Academy of Sciences, USA, 95:14863–14868, 1998. 6. T. Hastie, R. Tibshirani, and A. Buja. Flexible discriminant and mixture models. In J. Kay and D. Titterington, editors, Neural Networks and Statistics. Oxford University Press, 1995. 7. S. Kaski. Convergence of a stochastic semisupervised clustering algorithm. Technical Report A62, Helsinki University of Technology, Publications in Computer and Information Science, Espoo, Finland, 2000. 8. S. Kaski, J. Sinkkonen and J. Peltonen. Bankruptcy Analysis with SelfOrganizing Maps in Learning Metrics IEEE Transactions on Neural Networks, 2001, in press. 9. K. V. Mardia. Statistics of directional data. Journal of the Royal Statistical Society. Series B, 37:349–393, 1975. 10. F. Pereira, N. Tishby, and L. Lee. Distributional clustering of English words. In Proceedings of the 30th Annual Meeting of the Association for Computational Linguistics, pages 183–190. 1993. 11. J. Sinkkonen and S. Kaski. Clustering based on conditional distributions in an auxiliary space. Neural Computation, 2001, in press. 12. T. Hofmann, J. Puzicha, and M. I. Jordan. Learning from dyadic data. In M. S. Kearns, S. A. Solla, and D. A. Cohn, editors, Advances in Neural Information Processing Systems 11, pages 466–472. Morgan Kauﬀmann Publishers, San Mateo, CA, 1998.
Clustering Gene Expression Data
87
13. N. Tishby, F. C. Pereira, and W. Bialek. The information bottleneck method. In 37th Annual Allerton Conference on Communication, Control, and Computing, Urbana, Illinois, 1999.
Learning to Learn Using Gradient Descent Sepp Hochreiter1 , A. Steven Younger1 , and Peter R. Conwell2 1 Department of Computer Science University of Colorado, Boulder, CO 80309–0430 2 Physics Department Westminster College, Salt Lake City, Utah
Abstract. This paper introduces the application of gradient descent methods to metalearning. The concept of “metalearning”, i.e. of a system that improves or discovers a learning algorithm, has been of interest in machine learning for decades because of its appealing applications. Previous metalearning approaches have been based on evolutionary methods and, therefore, have been restricted to small models with few free parameters. We make metalearning in large systems feasible by using recurrent neural networks with their attendant learning routines as metalearning systems. Our system derived complex well performing learning algorithms from scratch. In this paper we also show that our approach performs nonstationary time series prediction.
1
Introduction
Phrases like “I have experience in ...”, “This is similar to ...”, or “This is a typical case of ...” imply that the person making such statements learns the task at hand faster or more accurately than an unexperienced human. This learning enhancement results from solution regularities in a problem domain. In a conventional machine learning approach the learning algorithm mostly does not take into account previous learning experiences despite the fact that methods similar to human reasoning are expected to yield better performance. The use of previous learning experiences in inductive reasoning is known as “knowledge transfer” [4,1,14] or “inductive bias shifts” [15,6,12]. In the research ﬁeld of “knowledge transfer” we focus on one of the most appealing topics: “metalearning” or “learning to learn” [4,14,13,11]. A metalearner searches out and ﬁnds appropriate learning algorithms tailored to speciﬁc learning tasks. To ﬁnd such learning methods, a supervisory algorithm that reviews and modiﬁes the training algorithm must be added. In contrast to the subordinate learning scheme, the supervisory routine has a broader scope. It must ignore the details unique to speciﬁc problems, and look for symmetries over a long time scale, i.e. it must perform “knowledge transfer”. For example consider a human as the supervisor and a kernel density estimator as the subordinate method. The human has previous experiences with overﬁtting and tries to avoid it by adding a bandwidth adaptation and improving the estimator. We want to automatically obtain such learning method improvements by replacing G. Dorﬀner, H. Bischof, and K. Hornik (Eds.): ICANN 2001, LNCS 2130, pp. 87–94, 2001. c SpringerVerlag Berlin Heidelberg 2001
88
Sepp Hochreiter, A. Steven Younger, and Peter R. Conwell metalearning system
input
x(j) y(j1)
target y(j)
subordinate system supervisory system
=
=
recurrent output neural net
fixed learning algorithm (BPTT  RTRL)
input
x(j)
target y(j1)
subordinate system adjustable output model adjustable learning algorithm
Fig. 1. The metalearning system consists of the supervisory and the subordinate system (sequence element j is processed). The subordinate system is a recurrent network. Its attendant learning algorithm represents the ﬁxed supervisory system. Target function arguments x are mapped to results y, e.g. y(j) = f (x(j)). The previous function result y(j − 1) is supplied to the subordinate system so that it can determinate the previous error of the subordinate model. Subordinate and supervisory outputs are identiﬁed.
the human part with an appropriate system. This automatic system must include an objective function to judge the performance of the learning algorithm and rules for the adjustment of the algorithm. Metalearning is known in the reinforcement learning framework [11,12]. This paper reports on our work on metalearning in a supervised learning framework where a model is supposed to approximate a function after being trained on examples. Our metalearning system consists of the supervisory procedure, which is ﬁxed, and of the adjustable subordinate system, which must be run on a certain medium (see left hand side of Figure 1). To exemplify this, for this medium we might have used a Turing machine (i.e. a computer) where the subordinate model and the subordinate training routine is represented by a program (see right hand side of Figure 1). Any changes to the program amount to changes in the subordinate learning algorithm1 . However, the output of the discrete Turing machine is not diﬀerentiable. Thus, only deductive or evolutionary strategies can be used to improve the Turing machine program. Instead of executing the subordinate learning algorithm with a Turing machine, our method executes the algorithm with a recurrent neural network in order to get a diﬀerentiable output. This is possible because a (suﬃciently large) recurrent neural network can emulate a Turing machine. The diﬀerentiable output allows us to apply gradient descent methods to improve the subordinate routine. A recurrent network with random initial weights can be viewed as a learning machine with a very poor subordinate learning algorithm. We hypothesize that gradient based optimization approaches can be used to derive a learning algorithm from a random starting point. The capability of recurrent networks to execute the subordinate system was proved and demonstrated in [3,19]. Several researchers have suggested meta1
It should be mentioned that in general, the coded model and the coded learning algorithm cannot be separated. Accordingly, with the term “learning algorithm” we mean both.
Learning to Learn Using Gradient Descent
89
learning systems based on neural networks and used genetic algorithms to adjust the subordinate learning algorithm [2,10,19]. Our goal is to obtain complex subordinate learning algorithms which need a large recurrent network with many parameters. Genetic algorithms are infeasible due to the large number of computations required. This paper introduces gradient descent for metalearning to handle such large systems, and, thus, to provide an optimization technique in the space of learning algorithms. Every recurrent neural network architecture with its attendant learning procedure is a possible metalearning system. One may choose for example backpropagation through time (BPTT [18,16]) or realtime recurrent learning (RTRL [9,17]) as attendant learning algorithms. The metalearning characteristic of these networks is only determined by the special kind of inputtarget sequences as described in section 2.1. Both BPTT and RTRL applied to standard recurrent nets do not yield good metalearning performance as will be seen in section 3. The reason for this poor performance is given in section 2.2. In the same section, the use of the Long ShortTerm Memory (LSTM [8]) architecture is suggested to achieve better results. Section 2.3 gives an intuition how the “inductive bias shift” (“knowledge transfer” ) takes place during metalearning. The experimental section 3 demonstrates how diﬀerent learning procedures for diﬀerent problem domains are automatically derived by our metalearning systems.
2 2.1
Theoretical Considerations The DataSetup for Metalearning with Recurrent Nets
This section describes the kind of inputtarget sequences that allow metalearning in recurrent nets. The training data for the metalearning system is a set of sequences {sk }, where sequence sk is obtained from a target function fk . At each time step j during processing the kth sequence, the metalearning system needs the function result yk (j) = fk (xk (j)) as a target. The input to the metalearning system consists of the current function argument vector xk (j) and a supplemental input which is the previous function result yk (j − 1). The subordinate learning algorithm needs the previous function result yk (j − 1) so that it can learn the presented mapping, e.g. to compute the subordinate model error for input xk (j − 1). We cannot provide the current target yk (j) as an input to the recurrent network since we cannot prevent the model from cheating by hardwiring the current target to its output. Figure 1 illustrates the inputs and targets for the diﬀerent learning systems. The metalearning system is penalized at each time point when it does not generate the correct target value, i.e. when the subordinate procedure was yet not able to learn the current function. This forces the metalearning system to improve the subordinate algorithm so that it becomes faster and more exact. Figure 2 shows test sequences after successful metalearning. New sequences start at 513, 770, and 1027 when the subordinate learning method produces large errors because the new function is not yet learned. After a few examples the subordinate system learned the new function.
90
Sepp Hochreiter, A. Steven Younger, and Peter R. Conwell
The characteristics of the derived subordinate algorithms can be inﬂuenced by the sequence length (more examples per function give more precise but slower algorithms), the error function, and the architecture. 2.2
Selecting a Recurrent Architecture for Metalearning
For simplicity we consider one function f giving the sequence (x1 , y1 ) , . . . , (xJ , yJ ), where yj = f (xj ). All training examples (xj , yj ) contain equal information about f . The indices correspond to the time steps of the recurrent net. We want to bias our metalearning system towards this prior knowledge. The information in the last output OJ (indicated by J) is determined by the entropy H (OJ  XJ ). Here probability variables are denoted by capital letters, e.g. X for the input, Y for the target, and O for the output. H(A) is the entropy of A and the conditional entropies are denoted by H (A  B) The last output is obtained by oJ = g (yj ; xJ , xj ) + , where g is a bijective function with variable yj and parameters xJ , xj . expresses disturbances during input sequence processing. We assume noisy mappings to avoid inﬁnite entropies. Neglecting , we get ∂g (Yj ; Xj , XJ ) H (OJ  XJ , Xj ) = H (Yj  Xj ) + EYj ,Xj ,XJ log , ∂Yj ∂g(Yj ;Xj ,XJ ) where p (Yj  xj , xJ ) = p (Yj  xj ), is the absolute value of the g’s ∂Xj Jacobian determinant, and EA,B,... is the expectation over variables A, B, . . .. The hidden state at time j is sj = u (sj−1 , xj , yj−1 ) and the output is oj = v (sj ). With i < j < J we get j ∂oJ ∂sl+1 ∂oJ ∂sj+1 ∂oJ ∂oJ ∂sj+1 ∂si+1 ∂sj+1 = , = , = . ∂yj ∂sj+1 ∂yj ∂yi ∂sj+1 ∂si+1 ∂yi ∂si+1 ∂sl l=i+1
Our prior knowledge says that exchanging example i and j should not aﬀect the output information. That is H (OJ  XJ , Xj ) = H (OJ  XJ , Xi ), and also should not change. In this case Yj = Yi , Xj = Xi , p (Yj  xj ) = p (Yi  xi ) for xj = xi , and H (Yj  Xj ) = H (Yi  X learn begin with arbitrary weight i ). At ∂S
= EY ,X ∂Si+1 . Thus, we obtain initialization EYj ,Xj ∂Yj+1 j i i ∂Yi ∂g(Yj ;Xj ,XJ ) ∂g(Yi ;Xi ,XJ ) = 0, or EYj ,Xj ,XJ log − EYi ,Xi ,XJ log ∂Yj ∂Yi j l=i+1
EYi ,Xi
∂Sl+1 = 0, log ∂Sl
∂sl+1 =1. e.g. with ∂sl
u restricted to a mapping from sl to sl+1 should be volume conserving. An architecture which incorporates such a volume conserving substructure should outperform other architectures. An architecture fulﬁlling this requirement is Long ShortTerm Memory (LSTM [8]).
Learning to Learn Using Gradient Descent
91
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 400
500
600
700
800
900
1000
1100
Fig. 2. Error vs. time after metalearning.
2.3
Bayes View on Metalearning
Metalearning can be viewed as constantly adapting and shifting the hyperparameters and the prior (“inductive bias shift”) because the subordinate learning algorithm is adapted to the problem domain during metalearning. As the experiments conﬁrm, also the prior of subordinate learning algorithms is data dependent. This was suggested in [7], too. Therefore diﬀerent previously observed examples might lead to diﬀerent current learning.
3
Experiments
We choose a squared error function for the supervisory learning routine. All networks possess 3 input and 1 nonrecurrent output units. All noninput units are biased and have sigmoid activation functions in [0, 1]. Weights are randomly initialized from [−0.1, 0.1]. All networks are reset after each sequence presentation. 3.1
Boolean Functions
Here we consider the set B16 of all Boolean functions with two arguments and one result. The linearly separable Boolean functions set B14 = B16 \ {XOR, ¬XOR} is used to evaluate metalearning architectures. B14 Experiments We compared following methods: (A) Elman network [5]. (B) Recurrent network with fully recurrent hidden layer trained with Back Propagation Through Time (BPTT [18,16]) truncated after 2 time steps and with Real Time Recurrent Learning (RTRL [9,17]). (C) Long ShortTerm Memory (LSTM [8]) with its corresponding learning procedure. The cell input is squashed to [−2, 2] by a sigmoid and the cell output is a sigmoid in [−1, 1]. For input gates the bias is set to −1.0. Table 1 gives the results. Only LSTM was successful.
92
Sepp Hochreiter, A. Steven Younger, and Peter R. Conwell
Table 1. The B14 experiments for Elman nets (“Elman”), recurrent networks with fully recurrent hidden layer (“Rec.”), and LSTM. The columns show: (1) architecture (“arch.”), (2) number of hidden units – for LSTM “6/6(1)” means 6 hidden units and 6 memory cells of size 1 –, (3) learning method – for Elman nets and LSTM their learning methods are used, and BPTT is truncated after 2 time steps –, (4) batch (“b”) or online (“o”) update, (5) learning rate α – 0.0010.1 means diﬀerent learning rates in this range –, (6) training epochs, (7) training and (8) test mean squared error per time step (“MSEt”), and (9) successful training (“success”). arch. Elman Rec. Rec. LSTM
hid. units 15 20 10 6/6(1)
learning method Elman BPTT(2) RTRL LSTM
upα train train test date time MSEt MSEt b 0.0010.1 5000 b & o 0.0010.01 40000 b & o 0.0010.1 20000 0.22 0.21 o 0.001 1000 0.033 0.038
success NO NO NO YES
B16 Experiments The results are shown in Table 2. The mean squared errors per time step (MSEts) are lower than at B14 because the large error at the beginning of a new function scales down with more examples. See Figure 2 for absolute error metalearning. The peaks at 513, 770, and 1027 indicate large error when the function changes. 3.2
Semilinear Functions
We obtain functions 0.5 (1.0 + tanh (w1 x1 + w2 x2 + w3 )) with input vector x = (x1 , x2 ) by choosing each parameter wl randomly from [−1, 1]. Table 2 presents the results. With more examples per function the pressure on error reduction for the ﬁrst examples is reduced which leads to slower but more exact learning. 3.3
Quadratic Functions
The problem domain are the quadratic functions a x21 + b x22 + c x1 x2 + d x1 + e x2 + f scaled to the interval [0.2, 0.8]. The parameters a, . . . , f are randomly chosen from [−1, 1]. We introduced another hidden layer in the LSTM architecture which receives incoming connections from the ﬁrst standard LSTM hidden layer, and has outgoing connections to the output and the ﬁrst hidden layer. The ﬁrst hidden layer has no output connections. The second hidden layer might serve as a model which is seen by the ﬁrst hidden layer. The standard LSTM learning algorithm is used after the error is propagated back into the ﬁrst hidden layer. LSTM has a 6/12(1) architecture in the ﬁrst hidden layer (notation as in Table 1) and 40 units in the second hidden layer (5373 weights). To speed up learning, we ﬁrst trained on 100 examples per function and then increased this number to 1000. This corresponds to a bias towards fast learning algorithms. The results are listed in Table 2. The authors are not aware of any iterative learning algorithm with to the derived subordinate method comparable performance.
Learning to Learn Using Gradient Descent
93
Table 2. LSTM results for the B14 , B16 , semilinear (“semil.”) and quadratic functions (“quad.”). The columns show: (1) experiment name, (2) number of training sequences (“# functions”), (3) length of training sequences (examples per function – “# examples”), (4) training epochs, (5) training MSEt, (6) test MSEt, (7) training time for the derived algorithm (“train time subordinate”), and (8) maximal training mean squared error per example of the subordinate system after training (“train MSE subordinate”). The B14 architecture was used except for “quad.” (see text for details). experi # func # examment tions ples B14 128 64 B16 256 256 semil. 128 64 semil. 128 1000 quad. 128 1000
3.4
train time 1000 800 10000 5000 25000
train MSEt 0.033 0.0055 0.0007 0.0020 0.00061
test train time train MSE MSEt subordinate subordinate 0.038 6 0.003 0.0058 6 0.002 0.0008 10 0.07 0.0025 50 0.05 0.00068 35 0.02
Summary of Experiments
The experiments demonstrate that our system automatically generates learning methods from scratch and that the derived online learning algorithms are extremely fast. The test and the training sequence for the metalearning system contains rapidly changing dynamics, i.e. the changing functions, what can be viewed as a very nonstationary time series. Our system was able to predict well on never seen changing dynamics in the test sequence. The nonstationary time series prediction is based on rapid learning if the dynamic changes.
4
Conclusion
Previous approaches to metalearning are infeasible for a large number of system parameters. To handle many free parameters this paper presented the application of gradient descent to metalearning by using recurrent nets. Our theoretical analysis indicated that LSTM is a good metalearner what was conﬁrmed in the experiments. With an LSTM net our system derived a learning algorithm able to approximate any quadratic function after only 35 examples. Our approach requires a single training sequence, therefore, it may be relevant for lifelong learning and autonomous robots. The metalearner proposed in this paper performed nonstationary time series prediction. We demonstrated how a machine can derive novel, very fast algorithms from scratch.
Acknowledgments The Deutsche Forschungsgemeinschaft supported this work (Ho 1749/11).
94
Sepp Hochreiter, A. Steven Younger, and Peter R. Conwell
References 1. R. Caruana. Learning many related tasks at the same time with backpropagation. In G. Tesauro, D. Touretzky, and T. Leen, editors, Advances in Neural Information Processing Systems 7, pages 657–664. The MIT Press, 1995. 2. D. Chalmers. The evolution of learning: An experiment in genetic connectionism. In D. S. Touretzky, J. L. Elman, T. J. Sejnowski, and G. E. Hinton, editors, Proc. of the 1990 Con. Models Summer School, pages 81–90. Morgan Kaufmann, 1990. 3. N. E. Cotter and P. R. Conwell. Fixedweight networks can learn. In Int. Joint Conference on Neural Networks, volume II, pages 553–559. IEEE, NY, 1990. 4. H. Ellis. Transfer of Learning. MacMillan, New York, NY, 1965. 5. J. L. Elman. Finding structure in time. Technical Report CRL 8801, Center for Research in Language, University of California, San Diego, 1988. 6. D. Haussler. Quantifying inductive bias: AI learning algorithms and Valiant’s learning framework. Artiﬁcial Intelligence, 36:177–221, 1988. 7. S. Hochreiter and J. Schmidhuber. Flat minima. Neural Comp., 9(1):1–42, 1997. 8. S. Hochreiter and J. Schmidhuber. Long shortterm memory. Neural Computation, 9(8):1735–1780, 1997. 9. A. J. Robinson and F. Fallside. The utility driven dynamic error propagation network. Technical Report CUED/FINFENG/TR.1, Camb. Uni. Eng. Dep., 1987. 10. T. P. Runarsson and M. T. Jonsson. Evolution and design of distributed learning rules. In 2000 IEEE Symposium of Combinations of Evolutionary Computing and Neural Networks, San Antonio, Texas, USA, page 59. 2000. 11. J. Schmidhuber, J. Zhao, and M. Wiering. Simple principles of metalearning. Technical Report IDSIA6996, IDSIA, 1996. 12. J. Schmidhuber, J. Zhao, and M. Wiering. Shifting inductive bias with successstory algorithm, adaptive levin search, and incremental selfimprovement. Machine Learning, 28:105–130, 1997. 13. J. Schmidhuber. Evolutionary principles in selfreferential learning, or on learning how to learn: The metameta... hook. Inst. f¨ ur Inf., Tech. Univ. M¨ unchen, 1987. 14. S. Thrun and L. Pratt, editors. Learning To Learn. Kluwer Academic Pub., 1997. 15. P. Utgoﬀ. Shift of bias for inductive concept learning. In R. Michalski, J. Carbonell, and T. Mitchell, editors, Machine Learning, volume 2. Morgan Kaufmann, 1986. 16. P. J. Werbos. Generalization of backpropagation with application to a recurrent gas market model. Neural Networks, 1, 1988. 17. R. J. Williams and D. Zipser. A learning algorithm for continually running fully recurrent networks. Technical Report ICS 8805, Univ. of Cal., La Jolla, 1988. 18. R. J. Williams and D. Zipser. Gradientbased learning algorithms for recurrent networks and their computational complexity. In Y. Chauvin and D. E. Rumelhart, editors, Backpropagation: Theory, Architectures and Applications. Hillsdale, 1992. 19. A. S. Younger, P. R. Conwell, and N. E. Cotter. Fixedweight online learning. IEEETransactions on Neural Networks, 10(2):272–283, 1999.
A Variational Approach to Robust Regression Anita C. Faul and Michael E. Tipping Microsoft Research, Cambridge, U.K.
Abstract. We consider the problem of regression estimation within a Bayesian framework for models linear in the parameters and where the target variables are contaminated by ‘outliers’. We introduce an explicit distribution to explain outlying observations, and utilise a variational approximation to realise a practical inference strategy.
1
Introduction
We study the problem of interpolating through data where the dependent variables are assumed to be noisy (‘regression’, ‘curve ﬁtting’ or ‘signal estimation’). Our data comprises pairs {xn , tn }, where n = 1, . . . , N , and we consider linear interpolation models where the predictor y(x) is expressed as a linearlyweighted sum of a set of M ﬁxed basis functions φm (x), m = 1, . . . , M : y(x) =
M
wm φm (x).
(1)
m=1
We assume that the ‘target’ variables deviate from this mapping under some additive noise process: i.e. tn = y(xn ) + n . Typically, we assume independent Gaussian noise: n ∼ N (0, σ 2 ). The linear nature of this form of model facilitates a Bayesian treatment of the parameters w = (w1 , . . . , wM )T , since, if we adopt a Gaussian prior thereover, and with a Gaussian noise model, the posterior and marginal likelihood are both similarly Gaussian. For example, if we choose a prior: p(wA) = N (0, A), where the weight probabilities depend on some hyperparameterised variance model A, then we can often obtain excellent results utilising the ‘typeII maximum likelihood’ method: by integrating out the weights and optimising the resulting marginal likelihood with respect to A. An excellent article outlining this approach, for the parameterisation A = αI, is provided by [2]. In this paper we instead utilise a prior of the form A = diag (α1 , . . . αM ), recently introduced to realise the relevance vector machine [4]. A wellknown diﬃculty with a Gaussian noise model is that it is not robust, in that if the target values are contaminated by outliers, the accuracy of the predictor y(x) can be signiﬁcantly compromised. In such circumstances, it is common to utilise a noise () distribution with heavier tails, such as a Studentt. However, this prevents the analytic marginalisation over w, limiting the application of the Bayesian framework.
Author to whom correspondence should be directed:
[email protected] G. Dorﬀner, H. Bischof, and K. Hornik (Eds.): ICANN 2001, LNCS 2130, pp. 95–102, 2001. c SpringerVerlag Berlin Heidelberg 2001
96
Anita C. Faul and Michael E. Tipping
Here, in Section 2, we describe an alternative approach. We deﬁne an outlier distribution, p0 (t), which is incorporated in a mixture with the standard likelihood model with the express purpose of explaining the outliers in the data. The use of the mixture prevents us evaluating the marginal likelihood analytically, but we utilise a variational approximation to obtain a lower bound which we can optimise eﬀectively with respect to the hyperparameters. We demonstrate the utility of our method on some test data in Section 3.
2 2.1
Detail of the Method Bayesian Interpolation
First, we deﬁne our data model, p1 (t). Conventionally, we assume that the training examples have been generated independently by a Gaussian with mean given by (1) and with standard deviation σ1 . So, letting t = (t1 , . . . , tN )T and, for convenience, β1 ≡ σ1−2 : 1/2 N N β1 β1 p1 (tw, β1 ) = p1 (tn w, β1 ) = exp − (tn − φTn w)2 , (2) 2π 2 n=1 n=1 where φn = (φ1 (xn ), . . . , φM (xn ))T . In conjunction with this likelihood model, we utilise a Gaussian prior over the parameters: p(wα) =
M
α αm 1/2 m 2 exp − wm 2π 2 m=1
(3)
The use of an individual hyperparameter αm to moderate the contribution of individual basis functions is a form of Bayesian ‘automatic relevance determination’, as utilised in the ‘relevance vector machine’ [4]. The reader is referred to the appropriate references for details of the method, but in summary, learning proceeds by integrating out the weights to compute the marginal likelihood: (4) p(tα, β1 ) = p1 (tw, β1 )p(wα) dw = N (0, C), where C = β1−1 I + ΦA−1 ΦT and deﬁning Φ = (φ1 (x), . . . , φN (x))T . We then maximise (4) over both α and β1 . The weight posterior can then be computed using these maximising values of the hyperparameters, and we generally utilise the posterior mean values of the weights to obtain the interpolant. In Figure 1 (left), we show an example interpolant of 100 examples from the univariate function sinc x = sin x/x, where uniform noise of ±0.2 has been added to the targets. For y(x), we utilise N ‘Gaussian’ basis functions, one located at each training example: φn (x) = exp{−(x − xn )2 /r2 }, with r = 2 here. We now consider the case of ‘outlying’ {x, t} pairs, and in Figure 1 (right), we show the resulting interpolant after 25 contaminated examples with t = 0.8 have been added. In common with practically all nonrobust approaches, performance deteriorates signiﬁcantly. The error increases fourfold, and the noise estimate is considerably exaggerated.
A Variational Approach to Robust Regression clean data 1 RMS error: 0.0438
contaminated data 1 RMS error: 0.1597
Noise: 0.113 Estimate: 0.106
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
0
−0.2
−0.2
−0.4 −10
−5
97
0
5
10
−0.4 −10
−5
Noise: 0.113 Estimate: 0.304
0
5
10
Fig. 1. Left: relevance vector regression with 100 examples of ‘clean’ data. Right: with 25 added examples with t ﬁxed at 0.8. The data, generated from the ‘sinc’ function shown in grey, is plotted as ﬁlled circles, while the ‘outliers’ are denoted by open circles. The RMS deviation from the true function is given, as well as the ‘true’ and estimated noise standard deviation.
2.2
Explaining Outliers
To alleviate this problem within the presented Bayesian framework, we introduce a second generative model p0 (t) of the data that is intended to explain the outlying observations. We consider two alternatives: A. Where p0 (t) is Gaussian centred at y(x). i.e. p0 (t) =
β0 2π
1/2
β0 T 2 exp − (t − φ w) , 2
(5)
with β0 ≡ σ0−2 and where σ0 σ1 and is considered ﬁxed. B. Where p0 (t) is independent of y(x). e.g. p0 (t) ∼ Uniform[min({tn }), max({tn })].
(6)
The observation density for a single example is then deﬁned as a mixture: p(tn ) = θp1 (tn ) + (1 − θ)p0 (tn ),
(7)
where θ is the proportion of ‘valid’ data, which we may attempt to learn. Note that p0 (t) in A above thus corresponds to a special case of the Gaussian mixture framework recently analysed by [3] for autoregressive models. Hence, since σ0 σ1 , case A above gives a heavytailed generative distribution centred at y(x), and can be considered a very rough approximation to the robust Studentt distribution (which itself is equivalent to a mixture of an inﬁnite number of Gaussians). That p0 (t) is Gaussian is necessary for analytic simplicity. Case B is simply a ‘general purpose’ distribution, and may take any form
98
Anita C. Faul and Michael E. Tipping
providing its parameters remain ﬁxed. The present choice is simply intended as a minimally informative (maximally entropic) explanation of the data. To write down the overall likelihood under this mixture, we introduce hidden variables zn , n = 1, . . . , N , which take the value 0 if tn was generated by the outlier distribution, and 1 otherwise. Letting z = (z1 , . . . , zN )T the likelihood is zn N 1/2 β1 β1 1−zn T 2 p(t β1 , w, z) = [p0 (tn )] exp − (tn − φn w) . (8) 2π 2 n=1 The marginal likelihood p(tα, β1 , θ) we desire is now given by p(tα, β1 , θ) = p(t β1 , w, z) p(wα) P (zθ) dwdz.
(9)
N The probability of the indicator variables z is simply P (zθ) = n=1 P (zn θ), where P (zn = 0) = 1 − θ and P (zn = 1) = θ, but in conjunction with (8) and our previous prior (3) over the parameters w, we see that the integration (9) is analytically intractable. We thus utilise a variational procedure which will give us a lower bound on p(tα, β1 , θ) which we will maximise to ﬁnd α, β1 and θ. 2.3
A Variational Approximation for the Robust Case
We note ﬁrst that ln p(t) can be expressed as the diﬀerence of two terms: ln p(t) = ln p(t, w, z) − ln p(w, z t), from which we write ln p(t) = ln
p(t, w, z) Q(w, z)
− ln
p(w, z t) Q(w, z)
(10) ,
(11)
where we have introduced an arbitrary ‘approximating’ distribution Q(w, z). Integrating both sides of (11) with respect to Q(w, z) gives p(t, w, z) p(w, z t) ln p(t) = Q(w, z) ln dwdz − Q(w, z) ln dwdz Q(w, z) Q(w, z) = L [Q(w, z)] + KL [Q(w, z) p(w, z t)] , (12) since Q(w, z) is a distribution and integrates to one. The second term is the KullbackLeibler divergence between the approximating distribution Q(w, z) and the posterior p(w, z t). Since KL[Q(w, z) p(w, z t)] ≥ 0, it follows that L[Q(w, z)] is a rigorous lower bound on ln p(t). The approach is to maximize L[Q(w, z)] with respect to Q(w, z). Since ln p(t) is independent of Q(w, z), this is equivalent to minimizing the KullbackLeibler divergence. It is wellknown that the KullbackLeibler divergence is minimized for Q(w, z) = p(w, z t). We thus see that the optimal Q(w, z) distribution is given by the true posterior, in which case KL(Q(w, z)  p(w, z t)) = 0 and the bound becomes exact.
A Variational Approach to Robust Regression
99
While one can adopt some parameterised form for Q(w, z), we note that if we simply assume that w and z are separable, i.e. Q(w, z) = Qw (w)Qz (z), then the KL divergence is minimized by inspection [5]: Qw (w) =
exp ln p(t, w, z) Qz (z) , exp ln p(t, w, z) Qz (z) dw
(13)
Qz (z) =
exp ln p(t, w, z) Qw (w) . exp ln p(t, w, z) Qw (w) dz
(14)
and
Note that these solutions are mutually dependent, so in practice we must iteratively cycle through them, improving (raising) the lower bound with each such iteration. We now give the expressions for the approximating distributions, the lower bound and the update equations. The Form of the QDistributions. We consider case A ﬁrst where, for our model, it follows that Qw (w) is Gaussian with covariance matrix: N −1 T β0 (1 − zn Qz (z) ) + β1 zn Qz (z) φn φn + A ΣA = , (15) n=1
and mean
µA = Σ A
N β0 (1 − zn Qz (z) ) + β1 zn Qz (z) tn φn
.
(16)
n=1
N The distribution Qz (z) is given by the product n=1 Qzn (zn ), where z n 1−zn 1/2 1/2 β0 (1 − θ) exp (−β0 an /2) β1 θ exp (−β1 an /2) , Qzn (zn ) = 1/2 1/2 β1 θ exp (−β1 an /2) + β0 (1 − θ) exp (−β0 an /2)
(17)
and where an = (tn − φTn µA )2 + φTn Σ A φn .
(18)
For case B the expressions diﬀer since p0 (tn ) is independent of w: −1 N T Σ B = β1 zn Qz (z) φn φn + A ,
(19)
n=1
µB = Σ B
β1
N
zn Qz (z) tn φn
,
(20)
n=1
zn 1−z θ(β1 /2π)1/2 exp (−β1 an /2) [(1 − θ)p0 (tn )] n . Qzn (zn ) = θ(β1 /2π)1/2 exp (−β1 an /2) + (1 − θ)p0 (tn )
(21)
100
Anita C. Faul and Michael E. Tipping
The Variational Lower Bound. The lower bound on ln p(t) is then given by L[Q(w, z)] = Qw (w)Qz (z) ln p(t, w, z)dwdz (22) − Qw (w) ln Qw (w) dw − Qz (z) ln Qz (z) dz, N 1 = L0 + zn Qz (z) (ln β1 − β1 an ) 2 n=1
1 [ln AΣ − tr {A(µµT + Σ)}] 2 N zn Qz (z) (ln zn Qz (z) − ln θ) +
+
n=1
+
N 1 − zn Qz (z) (ln[1 − zn Qz (z) ] − ln[1 − θ]),
(23)
n=1
where we have ignored constant terms and where N 1 [1 − zn Qz (z) ] (ln β0 − β0 an ) for case A, n=1 2 L0 = N for case B. n=1 1 − zn Qz (z) ln p0 (tn )
(24)
(Hyper)parameter Update Formulae. There is no obstacle to continuing with the variational formalism and deﬁning prior distributions over α, β1 (see [1]) and θ. Instead, however, for reasons of computational eﬃciency and based on our experience from [1], we choose to optimise L[Q(w, z)] over those parameters. Diﬀerentiating (23) with respect to αm , equating to zero and deﬁning quantities γm = 1 − αm Σ mm (see [2,4] for motivation for this) leads to the following update new = γm , αm (25) µ2m Diﬀerentiating with respect to β1 and θ and equating to zero gives reestimates N N zn Qz (z) 1 new β1 = Nn=1 , θnew = zn Qz (z) (26) N n=1 n=1 zn Qz (z) an Algorithm Sketch. The above computations and updates give an algorithm: 1. Initialise α (e.g. αm = M/var[t]), σ1 (e.g. σ12 = var[t]), σ0 (e.g. σ02 = 2var[t]) and θ (e.g. θ = 0.5). 2. Perform several (e.g. ≈ 5) updates of the approximating distributions Qw (w) and Qz (z) using (15)(17) or (19)(21). 3. Reestimate α, β1 and θ using (25) and (26). 4. Loop back to step 2 until all updates have stabilised. Note that in step 2, L[Q(w, z)] should never decrease after an individual update and so the evaluation of (23) serves as a useful check.
A Variational Approach to Robust Regression A: y−centred outliers 1 RMS error: 0.0552
B: uniform outliers
Noise: 0.113 Estimate: 0.105
Outliers: 20.8%
1 RMS error: 0.0517 0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
0
−0.2
−0.2 −5
0
5
10
Noise: 0.113 Estimate: 0.101
Outliers: 26.2%
0.8
−0.4 −10
101
−0.4 −10
−5
0
5
10
Fig. 2. Interpolation using the ycentred Gaussian model (left) and the uniformoutlier model (right). The data is identical to that of Figure 1. The estimate for θ is given, in terms of the proportion of ‘outliers’ estimated: i.e. 1 − θ. Note that the ‘true’ value of 1 − θ should be 20%. Data points are overlaid with a grey box if zn < 0.5, and with a diamond if 0.5 ≤ zn < 0.75.
3
Examples
For our ﬁrst example of application, we return to the data of Figure 1 earlier. Figure 2 shows the results using the two methods presented above. Compared with Figure 1 (right), the case B (uniform) p0 (t) model dramatically reduces the error from 0.1597 to 0.0517, which is close to the ‘optimal’ cleandata result of 0.0438. Note that no additional information was used by the training algorithm in this latter case, since θ was estimated from the data. The improvement results from the fact that p0 (t) is a more probable explanation of the outlying data, and so the interpolant (speciﬁcally, the weights) does not need to ‘ﬁt’ those points. We can see that in Figure 2, the ycentred Gaussian p0 (t) is slightly inferior to the uniform case, although still much better than Figure 1 (right). This is not unexpected; the uniform p0 (t) intuitively seems a better model of the contamination. However, we can consider the case where the noise process on the data is heavytailed, and in Figure 3 we utilise a Studentt distribution, with 4 degrees of freedom, to generate n . Here, although of course only an approximation, the more appropriate ycentred p0 (t) performs marginally better than the uniform case, and both are much superior to the nonrobust model.
4
Summary
We have detailed a Bayesian interpolation mechanism which is robust to ‘outliers’, by which we imply the case where either the data is ‘contaminated’ (ﬁg. 2) or where the noise process is heavytailed (ﬁg. 3). Certainly, the distributions p0 (t) that we utilise can only be approximations to the outliergenerating mechanism, but are nevertheless intended to be sensible, generalpurpose, choices,
102
Anita C. Faul and Michael E. Tipping student− t data
A: y−centred outliers
1.5 RMS error: 0.1093
1.5 RMS error: 0.0653
B: uniform outliers 1.5 RMS error: 0.0683
1
1
1
0.5
0.5
0.5
0
0
0
−0.5
−0.5
−0.5
−1
−1
−1
−1.5
−1.5
−1.5
−2 −10
−2 0
10
−10
−2 0
10
−10
0
10
Fig. 3. Left: interpolation using a standard Gaussian noise model on data generated from the ‘sinc’ function with additive Studentt noise. Centre: using the ycentred outlier model. Right: using the uniform model.
and we were careful to only show examples where the random samples did not conform to either our assumed noise or outlier distributions. We are currently obtaining results for other realistic contamination scenarios as well as realworld data, and are working to extend the variational procedure to the inﬁnite Gaussian mixture case, corresponding to a Studentt observation model.
References 1. C. M. Bishop and M. E. Tipping. Variational relevance vector machines. In C. Boutilier and M. Goldszmidt, editors, Proceedings of the 16th Conference on Uncertainty in Artiﬁcial Intelligence, pages 46–53. Morgan Kaufmann, 2000. 2. D. J. C. MacKay. Bayesian interpolation. Neural Computation, 4(3):415–447, 1992. 3. W. Penny and S. J. Roberts. Variational Bayes for nonGaussian autoregressive models. In Neural Networks for Signal Processing X, pages 135–144. IEEE, 2000. 4. M. E. Tipping. The Relevance Vector Machine. In S. A. Solla, T. K. Leen, and K.R. M¨ uller, editors, Advances in Neural Information Processing Systems 12, pages 652–658. MIT Press, 2000. 5. S. Waterhouse, D. J. C. MacKay, and T. Robinson. Bayesian methods for mixtures of experts. In M. C. Mozer, D. S. Touretzky, and M. E. Hasselmo, editors, Advances in Neural Information Processing Systems 8. MIT Press, 1996.
MinimumEntropy Data Clustering Using Reversible Jump Markov Chain Monte Carlo Stephen J. Roberts1 , Christopher Holmes2 , and Dave Denison2 1
Robotics Research, Department of Engineering Science, University of Oxford, UK 2 Department of Mathematics, Imperial College, London, UK
1
Introduction
Many problems in data analysis, especially in signal and image processing, require the unsupervised partitioning of data into a set of ‘selfsimilar’ classes or clusters. An ideal partitioning unambiguously assigns each datum to a single class and one thinks of the data as being generated by a number of data generators, one for each class. Many algorithms have been proposed for such analysis and for the estimation of the optimal number of partitions. The majority of popular and computationally feasible techniques rely on assuming that classes are hyperellipsoidal in shape. In the case of Gaussian mixture modelling [15,6] this is explicit; in the case of dendogram linkage methods (which typically rely on the L2 norm) it is implicit [9]. For some data sets this leads to an overpartitioning. Alternative methods, based on valley seeking [6] or maximatracking in scalespace [16,18,13] for example, have the advantage that they are free from such assumptions. They can be, however, sensitive to noise and computationally intensive in highdimensional spaces. In this paper we reconsider the issue of data partitioning from an informationtheoretic viewpoint and show that minimisation of partition entropy may be used to evaluate the most probable set of data generators. Rather than formulate the problem as one of a traditional modelorder estimation to infer the most probable number of classes we employ a reversible jump mechanism in a Markovchain Monte Carlo (MCMC) sampler which explores the space of diﬀerent model sizes.
2
Theory
In a previous publication, [14] we have detailed the use of minimumentropy criteria to construct candidate partition models for data sets. We brieﬂy describe the relevant theory in this section as well as exploring the issues of Bayesian learning and reversible jump MCMC methods. 2.1
Partitioning Entropy
Consider a partitioning of the data into a set of k = 1..K classes. The probability density function (pdf) of a single datum x, conditioned on this set of classes, is given by: G. Dorﬀner, H. Bischof, and K. Hornik (Eds.): ICANN 2001, LNCS 2130, pp. 103–110, 2001. c SpringerVerlag Berlin Heidelberg 2001
104
Stephen J. Roberts, Christopher Holmes, and Dave Denison
p(x) =
K
p(x  k)p(k)
(1)
k=1
We consider the overlap between the contribution of the kth class to this density function and the unconditional density p(x). This overlap may be measured by the KullbackLiebler measure between these two distributions. The latter is deﬁned, for distributions q(x) and p(x) as: q(x) def dx (2) KL (q(x) p(x)) = q(x) ln p(x) Note that this measure reaches a minimum of zero if, and only if, p(x) = q(x). For any other case it is strictly positive and increases as the overlap between the two distributions decreases. What we desire, therefore, is that the KL measure be maximised. We hence write our overlap measure for the kth class as: vk = −KL (p(x  k) p(x))
(3)
which will be minimised when the kth class is wellseparated from all others. We deﬁne a total overlap, V , as: def
V = vk =
K
p(k)vk ,
(4)
k=1
hence, combining the above equations, V =−
K
p(k)
p(x  k) ln
k=1
p(x  k) p(x)
dx
(5)
We therefore seek the partitioning for which V is minimal. Using Bayes’ theorem we may rewrite Equation 5 such that: K K p(k  x) ln p(k  x) p(x)dx + p(k) ln p(k) p(x  k)dx (6) − V = k=1
k=1
in which we note that the ﬁrst term is the expected Shannon entropy of the class posteriors and the second the negative entropy of the class priors (the integral component of the second term is unity). 2.2
Classes as Mixture Models
Consider a partitioning of the data into a set of k = 1...K classes. In the case of the wellknown Gaussian mixture model, each class density is taken to be a single Gaussian. This does not oﬀer the degree of ﬂexibilty we desire and we hence model the pdf of the data, conditioned on the kth class, via a semiparametric mixture model with J basis functions. Each mixture basis may be, for example,
MinimumEntropy Data Clustering
105
a Gaussian and hence each class in the candidate partitioning of the data is represented as a mixture of these. The classconditional posteriors may thence be written as a linear combination of the kernel posteriors, i.e. p = WΦ where p is the set of class posterior probabilities (in vector form), W is a transform matrix (not assumed to be square) and Φ is the set of kernel posteriors (in vector form). Hence the kth class posterior may be written as: def
pk = p(k  x) =
J
Wk,j φ(j  x)
(7)
j=1
If we are to interpret p as a set of class posterior probabilities we require pk ∈ [0, 1] ∀k and k pk = 1. As φ(j  x) ∈ [0, 1] so the ﬁrst of these conditions is met if each Wk,j ∈ [0, 1]. The second condition is met when each row of W sums to unity. Each class in a candidate partitioning of the data is hence formed from a mixture model whose kernels are ﬁxed but whose coupling matrix, W, is adapted so as to minimise the overall entropy of the partitioning. 2.3
Reversible Jump Sampling
In contrast to traditional modelorder determination, the reversiblejump approach, put forward in [7,12], takes an alternative formalism in which the number of model components (classes in this case) is considered as an unknown variable whose posterior density is estimated via a sampling process. Consider a move from W → W associated with a change in intrinsic dimension of K → K (i.e. a change in the hypothesised number of classes in the partitioning). We utilise the results of [7,12,2,8] to determine the acceptance probability of each proposed move. The acceptance probability for a proposed model change (W, K) → (W , K ) is given via a modiﬁed MetropolisHastings equation as: p(K , W  Φ) q(K, W  Φ) Jac(W, K → W , K ) (8) p(accept) = min 1, q(K , W  Φ) W  Φ) p(K, Jacobian target density proposal Using Bayes’ theorem we may expand the terms in the above Equation such that: p(K, W  Φ) ∝ p(Φ  W, K)p(W  K)p(K) and
q(K, W  Φ) ∝ q(W  Φ, K)q(K  Φ).
If we take the prior over the modelorder, p(K) to be the ﬂat between K = 1...J and q(K  Φ) = q(K) = q(K ) then we obtain an acceptance measure as, p(Φ  W , K )p(W  K ) q(W  Φ, K) Jac(W, K → W , K ) (9) min 1, p(Φ  W, K)p(W  K) q(W  Φ, K )
106
Stephen J. Roberts, Christopher Holmes, and Dave Denison
The likelihood measure, p(Φ  W, K), represents the evidence that the candidate partitioning (uniquely deﬁned via knowledge of W and K) conforms to our objective of minimal entropy. We hence deﬁne the loglikelihood measure as the change in entropy given the set of ﬁxed kernel responses, Φ, which from Equation (6) gives: def (10) log p(Φ  W, K) = H(K) − H(K  W, Φ) = ∆H in which H(K) is the prior entropy of the candidate partitioning and H(K  W, Φ) is the posterior entropy1 . Noting that each row of W corresponds to the mixing fractions which couple Φ to an output class, so a convenient form for the prior and proposal densities is a Dirichlet [3], namely: q(W  Φ, K) =
K
DJ (Wk,1:J  γ),
k=1
p(W  K) =
K
DJ (Wk,1:J  α).
(11)
k=1
in which Wk,1:J represents the row of W coupling the J kernel reponses, Φ, to the kth output class. The hyperparameter sets are deﬁned as γ = γ1J and α = α1J in which 1J is a vector of J 1s. Sampler Moves Our sampler has three moves from model (W, K) to (W , K ): update: No change in K takes place but the parameters of the existing model are updated. birth: An increase in model complexity such that the number of classes changes as K = K + 1. death: A decrease in the number of classes such that K = K − 1. At each iteration in the Markov chain one of these three moves is selected with probabilities pu , pb , pd . These sum to unity and are ﬁxed, in our implementation, each at 1/3 save at K = 1 and K = J (the minimum and maximum number of classes) where we set pd and pb to zero respectively. The proposed moves in the Markov chain are applied to a randomly chosen row of W, whose elements we denote as w = Wk∗ ,1:J . The changes in the parameter set associated with each of the three moves are as follows: update: The update step is made using a hybrid proposal2 (a) Within mode ‘tweak’, w ∼ DJ (w  βw) i.e. w is redrawn around a mean given by w. The hyperparameter β is chosen to be large (see later) so that w lies close to w. 1 2
Note that maximization of the loglikelihood in Equation (10) is hence equivalent to the minimization of the overlap measure in Equation (6). This usage of ‘hybrid’ is meant in the sense of a multiple component proposal density [17], rather than in the (later) usage of Neal [11] in which he refers to the use of gradient information in MCMC as hybridMCMC.
MinimumEntropy Data Clustering
107
(b) Mode hop: redraw row from prior, w ∼ DJ (w  α). row birth: add a row drawn from the proposal, W = W + w where w ∼ DJ (w  γ). row death: remove the row speciﬁed by w , i.e. W = W − w. We note that as the birth and death moves are independent of the current state the Jacobian term in the acceptance probability is unity. Furthermore, as the elements of W remain unchanged other than for those in w, so the ratios of densities in Equation 9 cancel for all but prior and proposal densities over w. Writing the resultant acceptance probability as min(1, r), so combining Equations 9,10 and 11: update (a) ru = exp[∆H − ∆H]
DJ (w  α) DJ (w  βw ) DJ (w  α) DJ (w  βw)
update (b) ru = exp[∆H − ∆H] birth rb = exp[∆H − ∆H]
DJ (w  α) DJ (w  γ)
rd = exp[∆H − ∆H]
DJ (w  γ) DJ (w  α)
death
where ∆H and ∆H are the loglikelihood measures (Equation 10) of the models speciﬁed by (W, K) and (W , K ) respectively. The Dirichlet density over w with hyperparameters γ = γ1J is: DJ (w  γ) =
J Γ (Jγ) γj −1 wj J j=1 Γ (γj )
(12)
j=1
and hence, for example, rb is given in full as: J Γ (Jγ)(Γ (α))J γ−α w rb = exp[∆H − ∆H] Γ (Jα)(Γ (γ))J j=1 j
(13)
In all the examples given in this paper we set α = 1 (thus giving a ﬂat reference prior) whilst γ, β are chosen to give acceptance rates of around 30% (we ﬁnd that γ = 1, β = 1000 worked well and these values are used in all examples in this paper). Convergence of the model is rapid and for the examples given in this paper only run the chain over some 1500 iterations, the ﬁrst 1000 of which are used as a burnin period. On the examples presented in this paper this short sample time is suﬃcient.
108
Stephen J. Roberts, Christopher Holmes, and Dave Denison 60
20
50 15
30
K
L
40
10
20 5
10 0
500
1000
1500
Iteration
0
0
500
3
3
2
2
1
1
0
−1
−2
−2
−2
−1
0
x
1
1
(c)
1500
2
(b)
0
−1
−3
1000
Iteration
x2
x
2
(a)
0
−3 −2
−1
0
x
1
1
2
(d)
Fig. 1. Wine data set: (a) Likelihood and (b) K during sampling. The ﬁrst 1000 iterations are burn in. (c) True class labels and (d) ﬁnal partitioning results. There are 5 errors in this example.
3 3.1
Results Wine Recognition Data
As a ﬁrst example we present results from a wine recognition problem. The data set consists of 178 13dimensional exemplars which are a set of chemical analyses of three types of wine. We utilise the ﬁrst three principle components from this data set and ﬁt an initial 20kernel set using the ExpectationMaximization (EM) algorithm [5,4] and perform a minimumentropy clustering. Figure 1(a) shows the evolution of the model likelihood in the Markov chain and plot (b) the posterior modelorder (number of classes). Plot (c) shows the true class partitions (projected onto the ﬁrst two dimensions of the data) and (d) the resultant partitioning. We note that the few diﬀerences between (c) and (d) are due to ‘outlying’ points from the class distributions and as such cannot be determined correctly by an unsupervised method. In this example there are 5 errors, corresponding to an equivalent classiﬁcation performance of 97.2%. Supervised analysis has been reported for this data set and our result is surprisingly good considering that supervised ﬁrstnearest neighbour classiﬁcation achieves only 96.1%, and multivariate lineardiscriminant analysis 98.9% [1]. 3.2
Infrared Spectra Data
In the ﬁrst example we formed a basis set from a Gaussian mixture, so that each class in the resultant partitioning was modelled as a Gaussian mixture. In this example, however, we utilise a highly nonGaussian basis set formed via a nonnegative matrix factorisation of the data [10]. Our resultant class models
MinimumEntropy Data Clustering 10
80
8
60
6
L
K
100
40
4
20 0
(a)
2
0
500
1000
1500
0
500
0
5
10
15
20
25
30
35
40
45
0
0 0.4
0
5
10
15
20
25
30
35
40
45
0
5
10
15
20
25
30
35
40
45
0
5
10
15
20
25
30
35
40
45
0.4
0
5
10
15
20
25
30
35
40
45
0.2 0
0.4
0.3 0.25 0.2
0.2 0
(b)
0.35
0.2
0.5
1500
0.45
0.2 0
1000 Iteration
0.4
(c)
0
Iteration
0.4
109
0.15 0.1
0
10
20
30
40
(d)
Fig. 2. IR spectra data: (a) Likelihood and (b) K during sampling. The ﬁrst 1000 samples are burn in. (c) example spectra, (d) resultant twocomponent spectra. The spectra range linearly from 813µm [144 on the plot].
are thence mixtures of these considerably more complex bases. The data set we consider comes from the InfraRed Astronomy Satellite (IRAS) which maps regions of the sky at infrared wavelengths. The data set consists of 531 examples of slightly overlapping blueband (813µm) and redband (1122µm) spectra. We consider here only the blueband data. Figure 2(a) and (b) show the evolution of the model likelihood and order during sampling. Plot (c) shows randomly selected examples from the data set and (d) the mostprobable twoclass spectral proﬁles as determined from the K = 2 hypothesis.
4
Conclusions
We have presented a computationally simple technique for data partitioning based on a linear mixing of a set of ﬁxed basis functions. The technique is shown to give good results on a variety of problems. The methodology is general and nonGaussian basis kernels may be employed in which case the estimated classconditional densities will be mixture models of the chosen basis functions. The method, furthermore, scales favourably with the dimensionality of the data space and the entropyminimisation algorithm is eﬃcient even with large numbers of samples.
Acknowledgement Part of this work was funded by the UK Engineering & Physical Sciences Research Council (grant EESD PRO 264) whose support is gratefully acknowledged.
110
Stephen J. Roberts, Christopher Holmes, and Dave Denison
References 1. S. Aeberhard, D. Coomans, and O. de Vel. ComparativeAnalysis of Statistical PatternRecognition Methods in HighDimensional Settings. Pattern Recognition, 27(8):1065–1077, 1994. 2. C. Andrieu, N. de Freitas, and A. Doucet. Sequential MCMC for Bayesian Model Selection. IEEE Signal Processing Workshop on Higher Order Statistics. Ceasarea, Israel, June 1416., 1999. 3. J.M. Bernardo and A.F.M. Smith. Bayesian Theory. John Wiley, 1994. 4. C.M. Bishop. Neural Networks for Pattern Recognition. Oxford University Press, Oxford, 1995. 5. A.P. Dempster, N.M. Laird, and D.B. Rubin. Maximum Likelihood from Incomplete Data via the EM Algorithm. J. Roy. Stat. Soc., 39(1):1–38, 1977. 6. K. Fukunaga. An Introduction to Statistical Pattern Recognition. Academic Press, 1990. 7. P. Green. Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika, 82:711–732, 1995. 8. C. Holmes and B.K. Mallick. Bayesian Radial Basis Functions of variable dimension. Neural Computation, 10:1217–1233, 1998. 9. A.K. Jain and R.C. Dubes. Algorithms for Clustering Data. Prentice Hall, 1988. 10. D. D. Lee and H. S. Seung. Learning the parts of objects by nonnegative matrix factorisation. Nature, 401:788–791, October 1999. 11. R.M. Neal. Bayesian learning for neural networks. Lecture notes in statistics. Springer, Berlin, 1996. 12. S. Richardson and P.J. Green. On Bayesian analysis of mixtures with an unknown number of components. Journal of the Royal Statistical Society (Series B), 59(4):731–758, 1997. 13. S.J. Roberts. Parametric and nonparametric unsupervised cluster analysis. Pattern Recognition, 30(2):261–272, 1997. 14. S.J. Roberts, R. Everson, and I. Rezek. Maximum Certainty Data Partitioning. Pattern Recognition, 33(5):833–839, 2000. 15. S.J. Roberts, D. Husmeier, I. Rezek, and W. Penny. Bayesian Approaches To Mixture Modelling. IEEE Transaction on Pattern Analysis and Machine Intelligence, 20(11):1133–1142, 1998. 16. K. Rose, E. Gurewitz, and G.C. Foz. A Deterministic Annealing Approach to Clustering. Pattern Recognition Letters, 11(9):589–594, September 1990. 17. L. Tierney. Markov Chains for exploring Posterior Distributions. Annals of Statistics, 22:1701–1762, 1994. 18. R. Wilson and M. Spann. A New Approach to Clustering. Pattern Recognition, 23(12):1413–1425, 1990.
Behavioral Market Segmentation of Binary Guest Survey Data with Bagged Clustering Sara Dolniˇcar1 and Friedrich Leisch2 1
Department of Tourism and Leisure Studies, Vienna University of Economics and Business Administration, A1090 Wien, Austria
[email protected] 2 Department of Statistics and Probability Theory, Vienna University of Technology, A1040 Wien, Austria
[email protected] Abstract. Binary survey data from the Austrian National Guest Survey conducted in the summer season of 1997 were used to identify behavioral market segments on the basis of vacation activity information. Bagged clustering overcomes a number of diﬃculties typically encountered when partitioning large binary data sets: The partitions have greater structural stability over repetitions of the algorithm and the question of the “correct” number of clusters is less important because of the hierarchical step of the cluster analysis. Finally, the bootstrap part of the algorithm provides means for assessing and visualizing segment stability for each input variable.
1
Introduction
The importance of binary data in social sciences is growing due to manifold reasons. Yesno questions are simpler and faster to answer for respondents. Not only does this fact increase the chances of the respondents ﬁnishing a questionnaire and answering it in a concentrated, spontaneous and motivated manner, binary question format also allows the designer of the questionnaire to pose more questions, as the single answer is less tiring. This is especially important for studies, where attitudes towards a multitude of objects are questioned, thus dramatically increasing the number of answers expected from the respondents as it is typically the case with guest surveys within the ﬁeld of tourism. These developments lead to an increasing number of medium to large empirical binary data sets available for data analysis. Turning to the ﬁeld of market segmentation, empirical binary survey data sets exclude a number of clustering techniques viable for analysis due to their size which seems to be too large for hierarchical and too small for parametric approaches. Most parametric approaches require very large amounts of data in relation to the number of variables, growing exponentially. For the use of latent class analysis, Formann [4] recommends a sample size of 5×2k , a very strict requirement, especially when item batteries of G. Dorﬀner, H. Bischof, and K. Hornik (Eds.): ICANN 2001, LNCS 2130, pp. 111–118, 2001. c SpringerVerlag Berlin Heidelberg 2001
112
Sara Dolniˇcar and Friedrich Leisch
20 items are not unusual, as it is the case in market segmentation, be it with demographic, socioeconomic, behavioral or psychographic variables. Unless these huge data sets are available, exploratory clustering techniques will broadly be applied to analyze the heterogeneity underlying the population sample. Among the exploratory approaches, the hierarchical clustering techniques require the data sets to be rather small, as all pairwise distances need to be computed in every single step of the analysis. This leaves us with partitioning approaches like learning vector quantization (LVQ) within the family of cluster analytic techniques. However, partitioning cluster methods typically give less insight into the structure of the data, as the number of clusters has to be speciﬁed apriori and solutions for diﬀerent number of clusters can often not be easily compared. Myers & Tauber [9] state in their classic book on market structure analysis that hierarchical clustering better shows how individuals combine in terms of similarities and partitioning methods produce more homogeneous groups.
2
The Bagged Clustering Approach
The central idea of bagged clustering [7,8] is to stabilize partitioning methods like Kmeans or LVQ by repeatedly running the cluster algorithm and combining the results. Bagging [1], which stands for bootstrap aggregating, has been shown a very succesfull method for enhancing regression and classiﬁcation algorithms. Bagged clustering applies the main idea of combining several predictors trained on bootstrap sets in the cluster analysis framework. Kmeans is an unstable method in the sense that in many runs one will not ﬁnd the global optimum of the error function but a local optimum only. Both initializations and small changes in the training set can have big inﬂuence on the actual local minimum where the algorithm converges. By repeatedly training on new data sets one gets diﬀerent solutions which should on average be independent from training set inﬂuence and random initializations. We can obtain a collection of training sets by sampling from the empirical distribution of the original data, i.e., by bootstrapping. We then run any partitioning cluster algorithm—called the base cluster method below—on each of these training sets. Bagged clustering explores the independent solutions from several runs of the base method using hierarchical clustering. Hence, it can also be seen as an evaluation of the base method by means of the bootstrap. This allows the researcher to identify structurally stable (regions of) centers which are found repeatedly. The algorithm works as follows: 1. Construct B bootstrap training samples XN1 , . . . , XNB by drawing with replacement from the original sample XN . 2. Run the base cluster method (Kmeans, LVQ, . . . ) on each set, resulting in B × K centers c11 , c12 , . . . , c1K , c21 , . . . , cBK where K is the number of centers used in the base method and cij is the jth center found using XNi .
Behavioral Market Segmentation of Binary Guest Survey Data
113
3. Combine all centers into a new data set C B = C B (K) = {c11 , . . . , cBK }. B 4. Run a hierarchical cluster algorithm on C B (or Cprune ), resulting in the usual dendrogram. 5. Let c(x) ∈ C B denote the center closest to x. A partition of the original data can now be obtained by cutting the dendrogram at a certain level, resulting B , 1 ≤ m ≤ BK, of set C B . Each point x ∈ XN is in a partition C1B , . . . , Cm now assigned to the cluster containing c(x). The algorithm has been shown to compare favorably to several standard clustering methods on binary and metric benchmark data sets [7]; please see [8] for a detailed analysis and experiments using artiﬁcial data with known structure, as space constraints do not allow us to include the results in this paper. Especially the exploratory nature of the approach is attractive for practitioners [3].
3
Behavioral Segmentation of Tourist Survey
Our application consists of the segmentation of tourist surveys for marketing purposes. A data set including 5365 respondents and 12 variables was used. The respondents were tourists spending their vacation in the rural area of Austria during the summer season of 1997, city tourists were excluded from the study. These visitors were questioned in the course of the Austrian National Guest Survey. The vacation activities used for behavioral segmentation purposes were: Activity Agreement (%) cycling 30.21 swimming 62.65 going to a spa 14.61 hiking 75.62 going for walks 93.25 organized excursions 21.62 excursions 77.04 relaxing 80.17 shopping 71.50 sightseeing 78.02 museums 45.09 using health facilities 13.61 The task is to ﬁnd market segments having homogeneous preferences in some of the activities. In addition to the variables that were used as segmentation base, a number of demographic, socioeconomic, behavioral and psychographic background variables is available in the extensive guest survey data set: age, daily expenditures per person, monthly disposable income, length of stay, intention to revisit Austria, intention to recommend Austria, number of prior vacations in Austria, etc. These variables were not used as input in the cluster analysis, as only homogeneous groups with respect to vacation activities were of interest. The background variables are only used to describe the market segments in more detail.
Sara Dolniˇcar and Friedrich Leisch
0
20
40
60
80
100
120
114
Cluster: 134 centers, 739 data points
0.0
0.0
0.4
0.4
0.8
0.8
Cluster: 372 centers, 2147 data points
ng ng pa ng ks ns ns ng ng ng s es cli mi s hiki wal rsio rsio laxi ppi seei eum ciliti s o cy wim for xcu xcu re sh ight mu h fa s s ing d e e alt go nize he ga or
ng ng pa ng ks ns ns ng ng ng s es cli mi s hiki wal rsio rsio laxi ppi seei eum ciliti s o cy wim for xcu xcu re sh ight mu h fa s s ing d e e alt go nize he ga or
Fig. 1. Bagged clustering dendrogram together with boxplots for two selected clusters
The upper part of Figure 1 depicts the dendrogram resulting from a bagged clustering analysis. Learning vector quantization (e.g., [10]) was used as base method with K = 20 centers in each run on B = 50 training sets. The resulting 1000 centers were then hierarchically clustered using Euclidean distance and Ward’s linkage method (e.g., [6]). We also tried other parameter combinations, but results were very similar, as the algorithm is not very sensitive to B and K once these are large enough. Bagged clustering has been implemented using the R package for statistical computing (http://www.Rproject.org), a free implementation of the S language; R functions for bagged clustering can be obtained from the authors upon request. The software allows interactive exploration of the dendrogram: by clicking on a subtree of the dendrogram one gets (in another window) a boxwhisker plot of the centers in the corresponding cluster CiB . Figure 1 shows as example boxplots corresponding to two such subtrees. The boxes range from the 25% quantile to the 75% quantile, the line in the middle represents the median, the whiskers and circles depict outliers. See the documentation of any modern statistics package for more details on boxwhisker plots. The horizontal polygon depicts the overall sample mean such that one can
Behavioral Market Segmentation of Binary Guest Survey Data
115
easily compare which variables are socalled marker variables of the segment, i.e., are diﬀerent in the segment than in the overall population and can be repeatedly found having similar values such that the corresponding boxes are small. The market segments corresponding to the two boxplots in Figure 1 can be described as follows: – Individual sightseers (left plot): This large segment (40 percent of the tourists questioned) have a clear focus when visiting Austria: They want to hop from sight to sight. Therefore both the items sightseeing and excursions are strongly and commonly agreed upon in this group. Neither sports nor shopping are of central importance, although some members do spend some of their leisure time undertaking those activities. Well reﬂecting the individualist character of this group is the heterogeneity of this segment concerning a number of activities, as e.g. swimming, hiking, shopping or visiting museums. – Health oriented holidaymakers (right plot): This niche segment represents a very stable and distinct interest group. Clearly, these tourists spend their vacation swimming and relaxing in spas and health facilities. Also, they all seem to enjoy going for a walk (after the pool is closed?). As far as the remaining activities are concerned, homogeneity decreases as indicated by the large dispersion of mean values. The information which variables “deﬁne” a segment (small boxes) and with respect to which variables a segment is heterogenous (large boxes) is unique to the bagged cluster approach. A small box indicates that the corresponding cluster center was stably found over all repetitions of the base method. Bagged clustering bootstraps the base cluster method, hence the sizes of the boxes visualize the dispersion of the segment mean for each input variable and indicate how “correlated” an input variable is with the segment. This information is not available if the data set is partitioned only once. Note that we have only {0, 1}valued data, hence data dispersion in a segment can also not be used. The analysis of the background variables shows that the sightseeing tourists are rather young (median 48 years) very fond of Austria, intend to revisit the country to a high extent and spend an average amount of money per day (52 Euro). The healthoriented tourists are moderately older (median 53 years), have similar intent to revisit the country, however spend signiﬁcantly more money per day (68 Euro).
4
Stability Analysis
We have also compared the stability of standard Kmeans and LVQ with bagged versions thereof. KMeans and LVQ were independently repeated 100 times using K = 3 to 10 clusters. Runs where the algorithms converged in local minima (SSE more than 10% larger than best solution found) were discarded. Then 100 bagged solutions were computed using K = 20 for the base method and B = 50 training sets. The resulting dendrograms were cut into 3 to 10 clusters.
116
Sara Dolniˇcar and Friedrich Leisch
All partitions of each method were compared pairwise using one compliance measure from supervised learning (Kappa index, [2]) and one compliance measure from unsupervised learning (corrected Rand index, [5]). Suppose we want to compare two partitions summarized by the contingency table T = [tij ] where i, j = 1, . . . , K and tij denotes the number of data points which are in cluster i in the ﬁrst partition and in cluster j in the second partition. Further let ti· and t·j denote the total number of data points in clusters i and j, respectively:
1 Partition 1 2 .. .
Partition 1 2 ... t11 t22 . . . t21 t22 . . . .. . . .. . . .
2 K t1K t2K .. .
t1· t2· .. .
K tK1 tK2 . . . tKK tK· t·1 t·2 . . . t·K t·· = N
In order to compute the Kappa index for an unsupervised classiﬁcation problem, we ﬁrst have to match the clusters from the two partitions such that they have maximal agreement. We do this by permuting the columns (or rows) of K matrix T such that the trace i=1 tii of T gets maximal. In the following we assume that T has maximal trace. Then the Kappa index is deﬁned as K K N −1 i=1 tii − N −2 i=1 ti· t·i κ= K 1 − N −2 i=1 ti· t·i which is the agreement between the two partitions corrected for agreement by chance given row and column sums. The Rand index measures agreement for unsupervised classiﬁcations and hence is invariant with respect to permutations of the columns or rows of T . Let A denote the number of all pairs of data points which are either put into the same cluster by both partitions or put into diﬀerent clusters by both partitions. Conversely, let D denote the number of all pairs of data points that are put into one cluster in one partition, but into diﬀerent clusters by the other partition. Hence, the partitions disagree for all pairs D and agree for all pairs A and A + D = N2 . The original Rand index is deﬁned as A/ N2 , we use a version corrected for agreement by chance [5] which can be computed directly from T as K tij K ti· K t·j N i,j=1 2 − i=1 2 j=1 2 / 2 ν = K t·j K ti· K t·j N K ti· 1 − i=1 2 i=1 2 + j=1 2 j=1 2 / 2 2 Figure 2 shows the mean and standard deviation of κ and ν for K = 3, . . . , 10 clusters and 100 ∗ 99/2 = 4950 pairwise comparisons for each number of clusters. Bagging considerably increases the mean agreement of the partitions for all number of clusters while simultaneously having a smaller variance. Hence, the procedure stabilizes the base method. It can also be seen that LVQ is more stable than KMeans on this binary data set.
0.7 0.6
KMN LVQ BC KMN BC LVQ
0.5
Corrected Rand (Mean)
0.6
0.2
0.3
0.5 0.3
0.4
Kappa (Mean)
0.7
KMN LVQ BC KMN BC LVQ
3
4
5
6
7
8
9
10
3
4
5
6
7
8
9
10
0.30
Number of Clusters
0.15
0.20
0.25
KMN LVQ BC KMN BC LVQ
0.05 0.00
0.00
0.05
0.10
0.15
0.20
Corrected Rand (Std. Dev.)
0.25
KMN LVQ BC KMN BC LVQ
0.10
0.30
Number of Clusters
Kappa (Std. Dev.)
117
0.4
0.8
Behavioral Market Segmentation of Binary Guest Survey Data
3
4
5
6
7
Number of Clusters
8
9
10
3
4
5
6
7
8
9
10
Number of Clusters
Fig. 2. Stability of clustering algorithms over 100 repetitions for 3 to 10 clusters: Mean kappa (top left), mean corrected Rand (top right), standard deviation of kappa (bottom left) and standard deviation of corrected Rand index (bottom right).
5
Summary
The bagged cluster algorithm has been applied to a binary data set from tourism marketing. Categorical data sets with very few categories are very common in the marketing sciences, yet most cluster algorithms are designed for metric input spaces, especially with Gaussian distributions. Hierarchical cluster methods allow for arbitrary distance measures (and hence arbitrary input spaces) but get quickly infeasible with increasing numbers of observations. Bagged clustering overcomes these diﬃculties by combining hierarchical and partitioning methods. This allows for new exploratory data analysis techniques and cluster visualizations. Clusters can be split into subsegments, each branch of the tree can be explored and the corresponding market segment identiﬁed and described. By bootstrapping partitioning cluster methods we can measure the variance of the cluster centers for each input variable, which is especially important for binary data where usually only cluster centers without any variance information
118
Sara Dolniˇcar and Friedrich Leisch
are available. This leads to easy seperation of variables in which a segment is homegenous, and variables where a segment is rather heterogenous. Finally, building complete ensembles of partitions also has a stabilizing eﬀect on the base cluster method. The average agreement between repetitions of the algorithm is considerably increased, while the variance is reduced. The partitions found in 2 independent runs are more similar to each other, reducing the need for subjective decisions of the practitioner which solution to choose. Our current work tries to generalize the approach to partitioning methods which are not necessarily represented by centers, e.g., fuzzy clusters. Using distance measures that operate on partitions directly (instead of representatives) these could then also be clustered using hierarchical techniques.
Acknowledgments This piece of research was supported by the Austrian Science Foundation (FWF) under grant SFB#010 (‘Adaptive Information Systems and Modelling in Economics and Management Science’).
References 1. Leo Breiman. Bagging predictors. Machine Learning, 24:123–140, 1996. 2. J. Cohen. A coeﬃcient of agreement for nominal scales. Educational and Psychological Measurement, 1960(20):37–46, 1960. 3. Sara Dolnicar and Friedrich Leisch. Behavioral market segmentation using the bagged clustering approach based on binary guest survey data: Exploring and visualizing unobserved heterogeneity. Tourism Analysis, 5(2–4):163–170, 2000. 4. Anton K. Formann. Die LatentClassAnalyse: Einf¨ uhrung in die Theorie und Anwendung. Beltz, Weinheim, 1984. 5. Lawrence Hubert and Phipps Arabie. Comparing partitions. Journal of Classiﬁcation, 2:193–218, 1985. 6. Leonard Kaufman and Peter J. Rousseeuw. Finding Groups in Data. John Wiley & Sons, Inc., New York, USA, 1990. 7. Friedrich Leisch. Ensemble methods for neural clustering and classiﬁcation. PhD thesis, Institut f¨ ur Statistik, Wahrscheinlichkeitstheorie und Versicherungsmathematik, Technische Universit¨ at Wien, Austria, 1998. 8. Friedrich Leisch. Bagged clustering. Working Paper 51, SFB “Adaptive Information Systems and Modeling in Economics and Management Science”, Wirtschaftsuniversit¨ at Wien, Austria, August 1999. 9. James H. Myers and Edward Tauber. Market Structure Analysis. American Marketing Association, Chicago, 1977. 10. Brian D. Ripley. Pattern recognition and neural networks. Cambridge University Press, Cambridge, UK, 1996.
Direct Estimation of Polynomial Densities in Generalized RBF Networks Using Moments Evangelos Dermatas Department of Electrical Engineering and Computer Technology,
[email protected] Abstract. We present a direct estimation method of the output layer weights in a polynomial extension of the generalized radialbasisfunction networks when used in pattern classiﬁcation problems. The estimation is based on the L2 distance minimization of the density and the population moments. Each synaptic weight in the output layer is derived as a nonlinear function of the training data moments. The experimental results, using one and twodimensional simulated data and diﬀerent polynomial orders, show that the classiﬁcation rate of the polynomial densities is very close to the optimum rate.
1
Introduction
Among the single hidden layer feedforward neural networks, the most widely used in applications is the Radialbasisfunction network (RBFN). The RBFN consists of two layers; in the hidden layer, the pattern vector is nonlinearly transformed using a family of radial or kernel functions, and the output layer performs a linear transformation. In a popular generalization of the RBFNs, the Generalized RBFNs construct hyperellipses around the centers of the basis functions [8]. It has been shown that the GRBFNs are excellent approximators in curveﬁtting problems (regression, classiﬁcation and density estimation) due to their simple network structure, and the development of fast learning algorithms [5], based on the KMeans algorithm for the estimation of the synaptic weights in the hidden layer. Direct estimation of the weights in the output layer can be easily derived by minimizing the mean square error between the actual and the desired output. Polynomial feedforward neural networks such as Sigmapi networks [6] [7], highorder networks [4], functional link architectures [10], the Ridge polynomial networks [9] and the product units [3] are capable to solve complex pattern recognition and regression problems using variances of the backpropagation learning rule. Most of them, even in the case of a single layer network, construct complex polynomial models which quickly lead to combinatorial explosion of the input components. This problem, introduced by Richard Bellman[1], appears frequently in the theory of universal approximation and it is known as the curse of the dimensionality. Only few pruning strategies have been proposed [6] to adjust the network structure in the training environment and to decrease the number of synaptic weights. G. Dorﬀner, H. Bischof, and K. Hornik (Eds.): ICANN 2001, LNCS 2130, pp. 119–126, 2001. c SpringerVerlag Berlin Heidelberg 2001
120
Evangelos Dermatas
Recently, a polynomial extension of the GRBFNs (PeGRBFN)has been used in density estimation and pattern classiﬁcation problems. The synaptic weights are estimated using a direct method based on moments [2]. The only structural diﬀerence from the GRBFN, is the inclusion of weighted highorder terms within the stimuli at each output neuron: O(x, A, M, S) =
Nj J
ajn g n (x − mj , sj ),
(1)
j=1 n=0
where g(x − mj , sj ) is a radialbasis function, J is the number of hidden neurons, A = {ajn } is the weights of the output neuron connecting the nth order stimuli of the j hidden neuron, and Nj is the number of highorder terms. A computational simple radialbasis function g : 3M → (0, 1] has been used in [2]: yj = g(x − mj , sj ) =
1+
M
1
i=1 sji (xi
− mji )2
,
(2)
where mj , sj are the synaptic weights of the j th neuron, and M is the dimensionality of the feature vector. One of the network weakness is the limited description capabilities of the network transfer function; highorder dependencies among the hidden layer outputs are excluded (eq. 1). Moreover, in the training process the optimization criterion ensures exact matching between the population and the density moments. Therefore, the uninvolved moments may diﬀer signiﬁcantly. This paper presents a PeGRBF network and a direct estimation method of the output layer weights overcoming the previous work disadvantages. More speciﬁc, we study the general type of a full connected polynomial network using also a novel optimization criterion based on the mean square error of all moment differences between the network pdf approximation and the statistical estimation. We prove that a direct estimation of the synaptic weights can be achieved by solving a system of linear equations, eliminating also the inﬁnitive power series. In the last section, two examples in one and twodimensional classiﬁcation problems demonstrate the accuracy of the moment based training method.
2
The FullConnected Polynomial GRBF Network
The fullconnected PeGRBFN is an extension of the classic RBF network. In the hidden layer, the feature vector x, x ∈ M is transformed to a vector of directional weighted distances using a ﬁnite set of points mk , k = 1, J deﬁned also in the feature space (mk ∈ M ). In the output layer, instead of linear neurons, fullconnected polynomials are used. The total transfer function is given by:
Direct Estimation of Polynomial Densities
O(y, A) =
N
N
...
n1 =0
=
N
an1 ...nJ
nJ =0 N
...
n1 =0
J
121
g nk (x − mk , sk )
(3)
yknk ,
(4)
k=1
an1 ...nJ
nJ =0
J k=1
where yk = g(x − mk , sk ) is the k th neuron output in the hidden layer. the hidden layer outputs yk is assumed to be bounded in the interval (0, 1], and A = {an1 ...nJ : 1 ≤ n1 , ..., nJ ≤ N }. Obviously, the PeGRBFN is a generalization of the GRBFN. In case of developing eﬃcient training algorithms for the PeGRBFN we expect better approximation capabilities than the GRBFN in density estimation and pattern classiﬁcation applications. In the rest of this paper we discuss the problem of estimating the weights in the output layer of the PeGRBFNs. The unknown parameters in the hidden layer (mk , sk , k = 1, J) can be estimated independently using selforganized clustering methods. Among them, the KMeans algorithm is an excellent choice due to its convergence properties.
3
Density Estimation Using Moments
Assuming that T examples are available for training X = {x1 , x2 , ..., xT }, and the hidden layer weights have already been estimated. In this case, the training patterns can be nonlinearly transformed to the corresponding extended patterns at the output of the hidden layer neurons (eq. 3), Y = {yi : yi = g(xi ), ∀xi ∈ X}. Let equation 3 represents a density of a random vector y which has A unknown parameters. If the kernel output is restricted in the area of (0, 1], the m1 , m2 , ..., mJ moment of the fullconnected polynomial density is Mp (m1 , m2 , ..., mJ ) =
0 1 0
= = Mp (m1 , m2 , ..., mJ ) =
1
0
1
...
0
... ... ...
y1m1 y2m2 ...yJmJ py (y, A)dy1 ...dyJ
J 1
0 k=1
N
1
ykmk O(y, A)dy1 ...dyJ
N N J mk yk ( ... an1 ...nJ yknk )dy1 ...dyJ 0 k=1 n1 =0 nJ =0 k=1 J 1
N
an1 ...nJ
1
n1 =0
nJ =0
0
N
N
J
n1 =0
...
(5)
nJ =0
an1 ...nJ
k=1
...
J 1
0 k=1
ykmk +nk dy1 ...dyJ
1 . mk + nk + 1
(6)
122
Evangelos Dermatas
The population moments for the set of training examples Y is given by: Ms (m1 , m2 , ..., mJ , Y ) =
T J 1 mk ytk . T t=1 k=1
The L2 distance of the population and the network density moments for all moments is: E(A, Y ) =
+∞ m1 =0
=
+∞
...
+∞
(Mp (m1 , m2 , ..., mJ ) − Ms (m1 , m2 , ..., mJ , Y ))2
(7)
mJ =0 +∞ N N J ( ... an1 ...nJ
...
m1 =0
mJ =0 n1 =0
nJ =0
k=1
Ti J 1 mk 2 1 ytk ) . − mk + nk + 1 Ti t=1 k=1
The unique minimum of the quadratic error function can be reached by solving the equations: ∂E(A, Y ) =0 ∂aq1 ...qJ +∞ m1 =0
(
N
...
n1 =0
N
+∞ J
...
mJ =0 k=1 J
an1 ...nJ
nJ =0
k=1
1 mk + q k + 1
T J 1 mk 1 − ytk ) = 0. mk + nk + 1 T t=1 k=1
J
Therefore, a total number of (N + 1) linear equations is derived: N
...
n1 =0
=
N
an1 ...nJ
nJ =0
+∞
...
m1 =0
+∞ J mJ =0 k=1
1 1 mk + nk + 1 mk + qk + 1
T +∞ +∞ J mk ytk 1 , ... T t=1 m =0 m =0 (mk + qk + 1) 1
J
(8)
0 ≤ q1 ...qJ ≤ N.
k=1
In a more compact form, the direct solution is given by N n1 =0
...
N
an1 ...nJ F (Pn , Pq , J) = U (Pq , J, Y ) ⇔ F A = U ⇔ A = F −1 U,
nJ =0
where: F (Pn , Pq , J) =
+∞ m1 =0
...
+∞ J mJ =0 k=1
1 1 mk + nk + 1 mk + qk + 1
+∞ J T +∞ mk ytk 1 , U (Pq , J, Y ) = ... T t=1 m =0 m =0 (mk + qk + 1) 1
J
k=1
(9) (10)
Direct Estimation of Polynomial Densities
123
Table 1. The ﬁrst 12 values of the function f (n, n) n,
f (n, n)
n,
f (n, n)
n,
f (n, n)
0, 1, 2, 3,
1.644934066 0.644934066 0.394934066 0.283822955
4, 5, 6, 7,
0.221322955 0.181322955 0.153545177 0.133137014
8, 9, 10, 11,
0.117512014 0.105166335 0.0951663356 0.0869018728
and Pn , Pq , is a sequence of positive integer numbers: Pn = (n1 , n2, ...nJ : ni ∈ ℵ ∪ {0}, 0 ≤ ni ≤ N, i = 1, J) and Pq = (q1 , q2, ...qJ : qi ∈ ℵ ∪ {0}, 0 ≤ qi ≤ N, i = 1, J). The inﬁnitive series in equations 9 and 10 can be eliminated, decreasing also the computational complexity. Initially, we prove that the function F (Pn , Pq , J) can be factorized, +∞
J +∞
1 1 (11) mk + nk + 1 mk + qk + 1 m1 =0 mJ =0 k=1 +∞ +∞ 1 1 1 1 = ... m1 + n1 + 1 m1 + q1 + 1 mJ + nJ + 1 mJ + qJ + 1 m1 =0 mJ =0
F (Pn , Pq , J) =
...
J terms
=
J
f (nk , qk ).
(12)
k=1
The function F (Pn , Pq , J) is symmetric: F (Pn , Pq , J) =
J k=1
f (nk , qk ) =
J
f (qk , nk ) = F (Pq , Pn , J).
k=1
Further more, the inﬁnitive series in f (n, q) is reduced as follows: 1 ∞ 1 1 +∞ m=1 ( m+n − m+q ), n =q n−q 1 1 = f (n, q) = +∞ 1 2 m+n+1m+q+1 , n=q i=1 i+n m=0 ∞ max(n,q) 1 ∞ 1 1 1 1 q−n , n =q m=min(n,q)+1 m m=1 m+n − m=1 m+q ), n =q n−q (
. = = +∞ 1 2 2 +∞ 1 , n=q , n = q i=1 i+n i=1 i+n +∞ 1 2 The series i=1 i+n converges for all n ∈ ℵ ∪ {0}. In table 1 the most useful function values in applications are given. In a critical action from a practical point of view, the inﬁnitive series involving in U (q, p, J, Y ) are transformed to a ﬁnite power series and a natural logarithm function of the training dataset Y,
124
Evangelos Dermatas
+∞ J T +∞ mk ytk 1 ... T t=1 m =0 m =0 mk + q k + 1 k=1 1 J +∞ T +∞ +∞ T J i i i 1 yt1 ytJ ytk 1 = = .. T t=1 i=0 i + q1 + 1 i + qJ + 1 T t=1 i + qk + 1 i=0 k=1 i=0
U (Pq , J, Y ) =
1 = T
J T t=1 k=1
−qk −1 ytk
+∞ i=0
U (Pq , J, Y ) = (−1)
J
1 T
i+qk +1 ytk
i + qk + 1 J T t=1 k=1
J terms
J T 1 −qk −1 = ytk T t=1
+∞ yi
tk
i=1
k=1
−qk −1 ytk ln (1 − ytk ) +
qk i=1
i
−
i−qk −1 ytk
i
qk yi
tk
i=1
.
i
(13) (14)
In case of fullconnected PeGRBFN, the matrix F depends on the order of the polynomials and the number of neurons in the hidden layer. It remains to prove that the inverse of the matrix F always exists. Experiments, carried out in computers, showed that the matrix F is inverted in typical network structures. The inversion process is independent to the training data and therefore can be derived oﬀline. Thus, the computational complexity of the proposed training 2J J method is O(max((N + 1) , T (N + 1) )).
4
Examples in Pattern Classiﬁcation Problems
The proposed density estimation method is evaluateed in a simulation study based on Gaussian distributed one and twodimensional examples. The ﬁrst dataset consists of onedimensional examples produced by two Gaussian distributions. The generators mean value was set to 0.3 and 0.7 respectively, and the standard deviation was 0.1. In case of onedimensional data, the constant parameters of the system of linear equations (eq.9 and 10) are rewriting F (n, q, J) = f (n, q), If .t denote the expectation value, then q U (q, 1, Y ) = − yt−q−1 ln (1 − yt ) − i=1 t
q = 0, N ,
1 i
yti−q−1
(15) t
,
q = 0, N.
(16)
The results of the two classiﬁcation experiments are shown in table 2. In the ﬁrst experiment the probabilistic optimum threshold of 0.5 (assuming equal a priori probability for each class) was used to classify the data. In the second experiment the examples were classiﬁed according to the polynomial approximation. The same order polynomial density was estimated from the training data of each individual class. Four data sizes of 200, 500, 1000 and 2000 examples were tested and the polynomial order varies from 2 to 6. In ﬁgure 1 the Gaussian generators pdf and the density approximation are shown in case of two, three and four order polynomials using the set of 2000
Direct Estimation of Polynomial Densities 4
4
3
125
5 4
3
3
2 2
2
1 1
1
0
1 1
2
3
0
0
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
2
2
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
3
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Fig. 1. Gaussian pdf and two, three and four order polynomial densities Table 2. Percent classiﬁcation rate in unidimensional data using the Gaussian pdfs and the polynomial approximation Data size Gaussian 200 500 1000 2000
96.0 98.6 98.1 97.5
2
3
4
5
6
81.0 94.0 81.4 81.1
94.0 95.8 94.6 95.7
94.0 95.8 91.1 96.4
97.0 97.0 91.1 96.8
97.0 86.6 90.6 55.8
Table 3. Percent classiﬁcation rate in twodimensional data using the Gaussian pdfs and the polynomial approximation using two up to six order terms Data size Gaussian 200 500 1000 2000
87.00 89.80 91.20 90.80
2
3
4
5
6
87.00 89.80 91.20 84.85
91.50 87.00 83.40 89.25
60.00 87.60 77.00 89.50
48.00 57.80 43.10 48.50
58.00 48.20 52.20 43.90
examples. The accuracy of the proposed weights estimation method is very sensitive to the number of training data and the order selection for the polynomial approximation as shown in the experimental results presented in table 2. The appropriate polynomial order gives excellent classiﬁcation rate which is very close to the optimum rate estimated using the minimum error criterion of the Gaussian classiﬁer. In the second experiment, the polynomial density approximation is evaluated in a twoclasses, twodimensional, pattern classiﬁcation problem. The same data sizes were tested using Gaussian pdf generators, (x1 , x2 ) ∼ (N (0.3, 0.2), N (0.3, 0.2)), and (x1 , x2 ) ∼ (N (0.7, 0.2), N (0.7, 0.2)). In table 3, the classiﬁcation rate of the maximum probability criterion using the Gaussian pdf data and the classiﬁcation rate obtained by the maximum polynomial density rule are shown in four data sets of 200, 500, 1000 and 2000 examples. As in case of onedimensional experiment, the polynomial order varies from 2 to 6. The number of unknown weights increases signiﬁcantly in case of highorder terms and fullconnected polynomial terms (the curse of the dimensionality prob
126
Evangelos Dermatas
lem). Consequently, the size of matrix F increases and computer restriction in the matrix inversion process decreases signiﬁcantly the accuracy of the density approximation; the maximum classiﬁcation accuracy in twodimensional data is 58% when the density is approximated using polynomial order greater than 10, instead of the optimum 87%.
References 1. Bellman, R. Adaptive Control Processes: A Guided Tour. Princeton University Press. (1961) 2. Dermatas, E.: Polynomial extension of the Generalized RadialBasis Function Networks in Density estimation and Classiﬁcation Problems. Neural Network World 10(4) (2000) 553–564 3. Durbin R., and Rumelhart D.: Product units: A computationally powerful and biologically plausible extension to backpropagation networks. Neural Computation 1 (1989) 133–142 4. Giles C., and Maxwell T.: Learning, invariance, and generalization in a highorder neural network. Applied Optics 26(23) (1987) 4972–4978 5. Haykin S.: Neural Networks: A comprehensive foundation. Prendice Hall International Inc. (1999) 298–305 6. Heywood M., and Noakes P.: A Framework for Improved Training of SigmaPi Networks. IEEE Tr. on Neural Networks 6(4) (1995) 893–903 7. McClelland J., and Rumelhart D.: Parallel Distributed Processing 1. MIT Press (1987) 8. Poggio T., and Girosi F.: Networks for approximation and learning. Proc. IEEE 78(9) (1990) 1481–1497 9. Shin Y., and Ghosh G.: Ridge Polynomial Networks. IEEE Tr. on Neural Networks 6(3) (1995) 610–622 10. Pao Y.: Adaptive Pattern Recognition and Neural Networks. Reading MA:AddisonWesley (1989)
Generalisation Improvement of Radial Basis Function Networks Based on Qualitative Input Conditioning for Financial Credit Risk Prediction Xavier Parra1, Núria Agell2 and Xari Rovira2 1
ESAII, Technical University of Catalonia, Av. Víctor Balaguer, s/n 08800 Vilanova i la Geltrú – Barcelona – Catalonia – Spain
[email protected] 2 ESADE, Ramon Llull University, Av. Pedralbes, 62 08034 Barcelona – Catalonia – Spain {agell, rovira}@esade.edu
Abstract. The rating is a qualified assessment about the credit risk of bonds issued by a government or a company. There are specialised rating agencies, which classify firms according to their level of risk. These agencies use both quantitative and qualitative information to assign ratings to issues. The final rating is the judgement of the agency’s analysts and reflects the probability of issuer default. Since the final rating has a strong dependency on the experts knowledge, it seems reasonable the application of learning based techniques to acquire that knowledge. The learning techniques applied are neural networks and the architecture used corresponds to radial basis function neural networks. A convenient adaptation of the variables involved in the problem is strongly recommended when using learning techniques. The paper aims at conditioning the input information in order to enhance the neural network generalisation by adding qualitative expert information on orders of magnitude. An example of this method applied to some industrial firms is given.
1
Introduction
The present paper aims at applying Neural Network techniques to predict credit of bonds issued by a government or a company. Predicting the rating of a firm therefore requires a thorough knowledge of the ratios and values that indicate the firm’s situation and, also, a thorough understanding of the relationships between them and the main factors that can alter these values [1]. The application of learning based techniques to acquire the analyst’s knowledge seems reasonable due to the special nature of the problem. The strong dependency on the expert’s knowledge is the main reason that brought us to the connectionist approach proposed. However, how to represent the input and output variables of a learning problem in a neural network implementation of the problem is one of the key decisions influencing the quality of the solutions one can obtain. Moreover, this is especially important when qualitative information is available during training.
G. Dorﬀner, H. Bischof, and K. Hornik (Eds.): ICANN 2001, LNCS 2130, pp. 127–134, 2001. c SpringerVerlag Berlin Heidelberg 2001
128
Xavier Parra, N´ uria Agell, and Xari Rovira
This application shows how Neural Network and Qualitative Reasoning techniques, and particularly orders of magnitude calculus [2], can be useful in the financial domain [3]. The paper gives a brief introduction to the neural network architecture used. Next, the process to prepare the reference scales of different qualitative variables to be operated is established. In section 5 an application of this neural network to credit risk evaluation of a firm or an issue of bonds is presented. The paper finishes with some conclusions and also with some comments about the implementation of this application and the first results obtained with it.
2
Radial Basis Function Networks Architecture
Radial basis function networks (RBF) are especially interesting for the problem proposed since they are universal classifiers [4]. RBF have been traditionally associated with a simple architecture of three layers [5] (see Fig. 1). Each layer is fully connected to the following one and the hidden layer is composed of a number of nodes with radial activation functions called radial basis functions. Each of the input components feeds forward to the radial functions. The outputs of these functions are linearly combined with weights into the network output. Each radial function has a local response (opposite to the global response of sigmoid function) since their output only depends on the distance of the input from a centre point.
x1
w1
…
…
…
xj
φi(x)
wi
…
…
…
x
φ1(x)
xm
φc(x)
wc
w0
∑
F(x)
Fig. 1. Radial basis function network architecture
Radial functions in the hidden layer have a structure that can be represented as follows:
φ i ( x ) = ϕ (( x − c i ) T R −1 (x − c i ))
(1)
where ϕ is the radial function used, {ci  i = 1, 2,…, c} is the set of radial function T centres and R is a metric. The term (x  c) R1 (x  c) denotes the distance from the input x to the centre c on the metric defined by R. There are several common types of
Generalisation Improvement of Radial Basis Function Networks
129
130
Xavier Parra, N´ uria Agell, and Xari Rovira
Generalisation Improvement of Radial Basis Function Networks
4
131
A Credit Risk Prediction: Rating Evaluation
Our objective in this application is double, in one hand the goal is to replicate the rating evaluation of the agencies by using radial basis function neural networks, and in the other hand is compare if this evaluation improves when there is available some qualitative information. If we consider the rating agency Standard & Poor’s, it uses the following labels to assign a rating to the firms: {AAA,AA,A,BBB,BB,B,CCC,CC,C}. From left to right these rankings go from high to low credit quality, i.e., the high to low capacity of the firm to return debt, thus they permit to classify firm according to their level of risk. The processes employed by Standard & Poor’s annalists to assign these labels are highly complex. Decision on final rating is based on the information given by the financial data, the industry and the country where the firm operates, forecast on the possibilities of the firm’s growth, and its competitive position. As a result the agency provides a global evaluation based on its own expertise to determine the rating. To this end we use an initial database that include some of the firms on the index Dow Jones 500. For each firm we have financial ratios as quantitative variables and the sector as a qualitative one. A s it can be seen in next table, the firms are considered in seven different sectors. Table 1. Sectors Sector Number Firms
Cyclical consumer 1 71
Noncyclical consumer 2 80
Technology
Utilities
3 42
4 38
Basic Materials 6 33
Industrial
Energy
7 58
8 31
The quantitative variables being used are: Interest coverage (IC); Market value over debt (MV/DBT), to see the indebtedness of the firm; Debt over assets (DBT/ATN), as a measure of leverage; Cashflow over debt (CF/DBT), which gives the idea of how many years will need the company to pay the debt; Return on assets (ROA); self financing percent (SELFIN), calculated as the profit over assets; Short term over long term debt (DC/DL), and sales growth (SALES). The experts agree that some of these variables have a strongly dependence. For that reason and after a statistical study, they were reduced to the following five: V1=IC, V2=MV/DEBT, V3=ROA, V4=SELFIN, V5=DC/DL. Each one of the variables has different landmarks, as it can be seen in table 3, according to expert’s knowledge.
5
Experiments and Results
Initially, we started with a database that included a total of 353 patterns. There were 12 input qualitative variables, 1 qualitative variable (sector) for each pattern and 1 output. Since many instances had missing values, all those instances that had one or more missing values were deleted from the database. Following the experts’ recommendations and due to the especial peculiarity of a sector of activity (technological sector), the technological companies were also deleted from the set of
132
Xavier Parra, N´ uria Agell, and Xari Rovira
patterns. Next step was, according to the experts’ knowledge, to select those variables that were the most relevant in computing credit risk. The input space was reduced from 12 to 5 variables, and from 495 to 244 instances. All 5 input variables are realvalued while the rating, i.e. the output variable, is a nominal variable with 6 different classes {AAA, AA, A, BBB, BB, B}, and has been represented using a 1of6 code. At this point, there were at least two options: to train a single RBF network with 5 inputs and 6 outputs, or to train 6 different RBF networks with 5 inputs each one but only 1 output. The former option is not too appropriate because of the low number of patterns available for training. The latter option is more efficient from the point of view of resource optimisation. Although the final size of the architecture training will be probably smaller for the single RBF network and the training faster, its generalisation will be worse, and to get a good generalisation is more important than the size or the training time. Thus, the experiments had been performed considering that the initial problem of classifying a pattern in 1of6 classes has been transformed in 6 different problems of classifying a pattern in a single class. The network will say whether the pattern is or is not of the class for which the network has been trained. Simulations have been carried out following the PROBEN1 standard rules [7]. The data set available has been sorted by the company name before partitioning it into three subsets: training set, validation set and test set, with a relation of 50%, 25% and 25% respectively. Table 2 shows the pattern distribution in each data subset. Note that for class AAA there are no patterns available in the test subset, and for class B there are no patterns to perform the training or the validation. Table 2. Pattern distribution over data subsets
Rating Training Validation Test Total
AAA 5 2 0 7
AA 18 10 7 35
A 53 28 23 104
BBB 41 18 27 86
BB 5 3 3 11
B 0 0 1 1
Total 122 61 61 244
To study and analyse the effect that qualitative input conditioning had over RBF generalisation, two different kinds of training have been done. First training (referred to as blind training) rescale all the input values to mean 0 and standard deviation 1, but do not take into account the experts' knowledge. Second training (expert training) performs the input transformation described in section 3, i.e. it considers the information on orders of magnitude to rescale the values (see Table 3). Table 3. Expert landmarks and signs
V1 V2 V3 V4 V5
ll
li
lr
s
1 1 0.02 0.2 0.0
4 2 0.07 1 0.1
10 8 0.15 10 0.3
+1 +1 +1 –1 +1
Generalisation Improvement of Radial Basis Function Networks
133
Initially, networks are trained on the training set while the validation set is used to adjust the radial function width (r). To perform this adjustment of the radial width, a total of 4000 simulations have been done for each class. Widths checked are from 0’0001 to 0’1 with increments of 0’0001, from 0’101 to 1’1 with increments of 0’001, from 1’11 to 11’1 with increments of 0’01 and 11’2 to 111’1 with increments of 0’1. The final width (see Table 4) is selected among the 4000 widths trained by applying the next criteria: (a) Choose the width that maximises classification accuracy for the validation set. (b) Choose the width that, with criterion (a), produces the smallest network. (c) Choose the width that, with criterion (b), minimises mean squared error for the validation set. (d) Choose the width that, with criterion (c), minimises mean squared error for the training set (e) Choose the width that, with criterion (d), maximises classification accuracy for the training set. Table 4. Radial function width (r) and classification accuracy for validation data set (CAva) and test data set (CAte)
AAA AA A BBB BB B
r 1,055 11,2 0,831 80,6 5,11 111,1
Blind training CAva CAte 96,7% 100,0% 82,0% 88,5% 73,7% 50,8% 63,9% 57,4% 95,1% 95,1% 100,0% 98,4%
r 51,2 68,1 4,7 32,5 19,9 111,1
Expert training CAva CAte 98,4% 100,0% 85,2% 90,2% 70,5% 59,0% 75,4% 59,0% 95,1% 95,1% 100,0% 98,4%
Once the radial width is determined, networks are trained on training and validation sets while the test set is used to assess the generalisation ability of the final solution. As can be seen in Table 4, classification accuracy for the expert training is better, or at least equal, than for the blind training. Since the only difference is the use of the expert landmarks during the input conditioning, it seems that the use of this kind of information during training can be useful. However, the initial problem was not to make six classifications independently, but just one. Moreover, each one of the six RBF networks trained say if a pattern is or is not of the class for which the network has been trained. Thus, the networks output can be: Yes, it is or No, it is not. Unfortunately, if we combine the classification we get from each one of the six networks, the answer is not necessary one of the six classes, but could be more than one or even none of them. This means that each input pattern could be correctly or incorrectly classified or even not classified. Table 5 collects this triple classification for the test set and, as can be seen, the expert training is again better than the blind training. The number of patterns correctly classified is almost a 40% better for the expert. At the same time, expert training has a lower indetermination in the classification (41,0% in front of the 47,5% of the blind training) and the same could be said for the number of patterns incorrectly classified (29,5% for 31,2%).
134
Xavier Parra, N´ uria Agell, and Xari Rovira Table 5. Final classification for test data set
Correctly classified Incorrectly classified Not classified
6
Blind training Expert training 13 21,3% 18 29,5% 19 31,2% 18 29,5% 29 47,5% 25 41,0%
Conclusion and Future Work
This paper presents an ongoing work, which provides strategies for synthesising qualitative information from variables, each one of which is qualitatively described in a different way. It has been proved that using the expert information we enhance the network generalisation. The system is applied in the financial domain to evaluate and simulate credit risk. But this approach may also be applicable to problems in other areas where the involved variables are described in terms of orders of magnitude. The limitations of the method presented cannot be evaluated until the implementation is completed and sufficiently tested. The proposed method is currently being implemented to be applied to available data referring to the most important American and European firms, whose Moody’s rating is known. Some of the future tasks consist in using the landmarks given by the experts to codify the input variables to use orders of magnitude labels and using the experts’ landmarks to define a qualitative distance in order to build qualitative Gaussian density functions. Discovering alternative methods for building a homogenised reference that takes advantage of expert’s knowledge. It is also intended to compare the obtained results with the results furnished by other classifiers used in artificial intelligence.
References 1. 2. 3. 4. 5. 6. 7.
Agell, N., Ansotegui, C., Prats F., Rovira, X., and Sánchez, M.: Homogenising References in Orders of Magnitude Spaces: An Application to Credit Risk Prediction. 14th International Workshop on Qualitative Reasoning. Morelia, Mexico (2000) Piera, N. Current Trends in Qualitative Reasoning and Applications, Monografía CIMNE, 33. International Centre for Numerical Methods in Engineering, Barcelona (1995) Goonatilake, S and Treleaven, Ph.: Intelligent Systems for Finance and Business, John Wiley & Sons (1996) Poggio, T. and Girosi, F.: Networks for Approximation and Learning, Proc. IEEE, vol. 78, pp. 14811497 (1990) Broomhead, D.S. and Lowe, D.: Multivariable Functional Interpolation and Adaptive Network, Complex Systems, Vol. 2, pp. 321355 (1988) Chen, S., Cowan, C.F.N. and Grant, P.M.: Orthogonal Least Squares Learning for Radial Basis Function Networks. Transactions on Neural Networks, Vol. 2, pp. 302309 (1991) Prechelt, L.: PROBEN1: Set of Neural Network Benchmark Problems and Benchmarking Rules. Technical Report 21/94, University of Karlsruhe (1994)
Approximation of Bayesian Discriminant Function by Neural Networks in Terms of KullbackLeibler Information Yoshifusa Ito1 and Cidambi Srinivasan2 1
2
Department of information and policy studies, AichGakuin University, Iwasaki, Nisshinshi, Aichiken 4700195, Japan
[email protected] Department of statistics, Patterson Oﬃce Tower, University of Kentucky, Lexington, Kentucky 40506, USA
[email protected] Abstract. Following general arguments on approximation Bayesian discriminant functions by neural networks, rigorously proved is that a three layered neural network, having rather a small number of hidden layer units, can approximate the Bayesian discriminant function for the two category classiﬁcation if the log ratio of the a posteriori probability is a polynomial. The accuracy of approximation is measured by the KullbackLeibler information. An extension to the multicategory case is also discussed.
1
Introduction
We treat the problem of pairwise approximation of the a posterior probabilities P (ωi x) by the outputs FW (ωi x) of a threelayered neural network having d linear units on the input layer and c output layer units, where x is the feature, W is the weight vector, d is the dimension of the feature space Rd and c is the number of categories ω1 , ..., ωc . It is known that the a posteriori probabilities can be used as discriminant functions in the Bayesian decision theory, and the L2 norm and the KullbackLeibler information have been two main measures to evaluate the accuracy of the approximation [2], [3], [5], [6], [7], [8]. In [8], the square output mean Ea (W ) of the diﬀerence of the network output and the desired is decomposed: Ea (W ) = Σi [FW (ωi x) − P (ωi x)]2 p(x)dx + Σi P (ωi x)(1 − P (ωi x))p(x)dx, where p(x) is the p.d.f of the feature. Since the second term is independent of W , minimizing Ea (W ) implies the respective convergences of the outputs FW (ωi x) toward the a posteriori probabilities P (ωi x). This decomposition is used in [2] and [5]. However, as pointed out in [3], there is a disadvantage of using the L2 norm in learning. In this paper, we use a cross entropy E(W ) = − Σi p(x, ωi ) log FW (ωi x)dx = K(pfW ) + I(p) decomposed similarly to Ea (W ), where p(x, ωi ) is the probG. Dorﬀner, H. Bischof, and K. Hornik (Eds.): ICANN 2001, LNCS 2130, pp. 135–140, 2001. c SpringerVerlag Berlin Heidelberg 2001
136
Yoshifusa Ito and Cidambi Srinivasan
ability density on Rd × Ω, Ω = {ω1 , ..., ωc }, and the ﬁrst term is the KullbackLeibler information. The meaning of this information is detailed in [9] and merits of using it as a cost function is discussed in [3]. In Section 3, we treat the twocategory case and prove rigorously that a three layered neural network, having a small number of hidden layer units, can approximate the a posteriori distributions in the sense of the KullbackLeibler information if the log ratio of the a posteriori distributions is a polynomial of low degree. Extension of the result to the multicategory case is discussed in Section 4. c FW (ωi x) = 1, one of them is a linear function Since there is a restriction Σi=1 of others. We use nonlinear units for the output and hidden layers. Hence, the outputs are N ϕ( j=1 aij φ(wj · x + tj ) + ai0 ), i = 1, ..., c − 1, FW (ωi x) = c−1 1 − i=1 FW (ωi x), i = c. Unless FW (ωi x) as well as P (ωi x) are positive on the support of p(x), the arguments in this paper are all meaningless. We will show later that this assumption cannot be violated while learning.
2
Bayesian Decision Theory and Neural Networks
Note that fW (x, ω) = FW (ωx)p(x) is also a probability distribution on Rd × Ω. The accuracy of approximation of a distribution p(x, ω) by fW (x, ω) can be measured with the KullbackLeibler information: c p(x, ωi ) K(pfW ) = dx p(x, ωi ) log f d W (x, ωi ) R i=1 =
Rd
p(x)
c
P (ωi x) log
i=1
P (ωi x) dx. FW (ωi x)
(1)
To compare categorywise the accuracy of the approximation of P (ωi x) by FW (ωi x) based on the kullbackLeibler information with that based on the L2 distance, we deﬁne Km (P (ωi ·)FW (ωi ·)) c P (ωj x) dxFW (ωi x) is ﬁxed = min p(x) P (ωj x) log Rd FW (ωj x) j=1
P (ωi x) 1 − P (ωi x) = + (1 − P (ωi x)) log p(x)dx. P (ωi x) log FW (ωi x) 1 − FW (ωi x) Rd
(2) In the case of two category classiﬁcation, this coincides with K(pfW ). We can prove that Km bounds the L2 norm: P (ωi ·) − FW (ωi ·)L2 (Rd ,p) ≤
1 Km (P (ωi ·)FW (ωi ·)), 2
i = 1, ...c.
(3)
Approximation of Bayesian Discriminant Function by Neural Networks
Here, we omit the proofs of (2) and (3). Our cost function is the cross entropy: r p(x) p(ωi x) log FW (ωi x)dx, E(W ) = − Rd
137
(4)
i=1
which can be asymptotically approximated by E (n) (W ) = −
n
1 log FW (ω (j) x(j) ), n j=1
(5)
where (x(j) , ω (j) )∞ j=1 is a sequence of independent teacher signals with distribution p(x, ω). If the p.d.f. p(x) is rapidly decreasing and log FW (ωi xi ) are bounded by a polynomial as in the case of most familiar distributions, E (n) (W ) converges to E(W ) almost everywhere by the strong law of large number. While learning, a new signal (x(j) , ω (j) ) must be incorporated and simultaneously the sum (5) must be subdued to the gradient descent. Since FW (ωi x) ↓ 0 implies E(W ) ↑ ∞, the condition FW (ωi x) > 0 cannot be violated while the gradient descent method is applied. We can decompose E(W ): E(W ) = K(pfW ) + I(p), where
I(p) =
Rd
p(x)
c
P (ωi x) log P (ωi x)dx.
(6)
(7)
i=1
This corresponds to the decomposition of Ea (W ) by Ruck et al. [8]. The second term I(p) is nothing but the mean negative entropy of the a posteriori probabilities. Since this term is independent of learning, minimization of E(W ) implies that of the ﬁrst term.
3
Two Category Classiﬁcation
A theorem below is proved in a more general form in [4]. The proof of this simpliﬁed theorem for n = 2 is described and used in [5]. Theorem 1. Let n ≥ 1, let 1 ≤ p ≤ ∞, let µ be a measure on Rd such that xn ∈ Lp (Rd , µ) and let φ be an n times continuously diﬀerentiable function such that φ(i) , i = 0, · · · , n, are bounded and φ(n) ≡0. Then, for any polynomial Q of degree n and any ε > 0, there are constants ai , i = 0, · · · , N , ti and vectors Vi in Rd , i = 1, · · · , N , for which ¯ Q(x) =
N
ai φ(Vi · x + ti ) + a0
i=1
¯ − QLp (Rd ,µ) < ε. satisﬁes Q
(8)
138
Yoshifusa Ito and Cidambi Srinivasan
In particular, if n = 1, N = 1 and, if n = 2, N = d + 1. The N can be rather a small number. Funahashi estimated N to be 2d for n = 2, consuming two φ s to approximate t2 [2]. Actually it can approximated by a linear sum of a single φ, t and a constant, and N = d + 1 is obtained [5]. In the case of twocategory classiﬁcation, the discriminant function can be g(x) = log P (ω1 x)  log P (ω2 x) and any monotone function of g(x) can also be a decision function [1]. If the logistic function σ(t) = (1 + e−t )−1 is used as the monotone function, σ(g(x)) = P (ω1 x) [2]. In Section 2, we have stated that the Km bounds the L2 distance. Here, we prove that the L1 distance of the log likelihood ratio bounds the Km . Proposition 2. Let p(x, ω) and q(x, ω) be p.d.f ’s of mutually continuous probability measures deﬁned on Rd × {ω1 , ..., ωc } such that P (ωi x) = 0, Q(ωi x) = 0 and p(x) = q(x). Then, for each i, we have that Km (P (ωi ·)Q(ωi ·) ≤ log Proof. Set δg(x) = log
P (ωi x) Q(ωi x) − log . 1 − P (ωi x) 1 − Q(ωi x)
Q(ωi x) =
P (ωi x) . P (ωi x) + (1 − P (ωi x))eδg(x)
Then,
Accordingly,
Km (P (ωi ·)Q(ωi ·)) = +
Rd
Q(ωi ·) P (ωi ·) − log 1 d . 1 − P (ωi ·) 1 − Q(ωi ·) L (R ,p)
Rd
(9)
p(x)P (ωi x) log(P (ωi x) + (1 − P (ωi x))eδg(x) )dx
p(x)(1 − P (ωi x)) log(1 − P (ωi x) + P (ωi x)e−δg(x) )dx.
We have that, if δg(x) > 0, 0 < log(P (ωi x) + (1 − P (ωi x))eδg(x) ) < δg(x), and if δg(x) < 0, 0 > log(P (ωi x) + (1 − P (ωi x))eδg(x) ) > δg(x). With similar inequalities for the second integral, we have that δg(x)p(x)dx = δgL1 (Rp ,p) . (10) Km (P (ωi ·)Q(ωi ·)) ≤ Rd
Now we are ready for proving the main theorem below. Theorem 3. Let p(x, ω) be the p.d.f of a probability distribution on Rd ×{ω1 , ω2 } such that log p(x, ω1 )/p(x, ω2 ) is a polynomial of degree n on the support of p(x) = p(x, ω1 ) + p(x, ω2 ), and let φ be an n times continuously diﬀerentiable function such that φ(i) , i = 0, · · · , n, are bounded and φ(n) (0) ≡0. Suppose that xn ∈ L1 (Rd , p). Then, for any ε > 0, there are constants ai , i = 0, · · · , N , ti and vectors Vi in Rd , i = 1, · · · , N , for which N ai φ(Vi · x + ti ) + a0 ), FW (ω1 x) = σ( i=1
(11)
Approximation of Bayesian Discriminant Function by Neural Networks
satisﬁes
Km (P (ω1 ·)FW (ω1 ·)) < ε,
139
(12)
where σ is the logistic function. In particular, if n = 1, N = 1 and, if n = 2, N = d + 1. Proof. Set Q(x) = log P (ω1 x)/P (ω2 x). Then, by assumption and Theorem 1, there are constants ai , ti and vectors Vi in Rd , i = 1, · · · , N , for which ¯ − QL1 (Rd ,p) < ε, Q
(13)
¯ ¯ ¯ where Q(x) is deﬁned by (9). Set FW (ω1 x) = σ(Q(x)), then Q(x) = log FW (ω1 x)/FW (ω2 x). By Proposition 2, Km (P (ω1 ·)FW (ω1 ·)) ≤ log
FW (ω1 ·) P (ω1 ·) − log 1 d P (ω2 ·) FW (ω2 ·) L (R ,p)
(14)
holds. Combining (13) and (14), we obtain (12). This theorem implies that the a posteriori probability can be approximated by the output of a three layered neural network having an output unit with activation function σ and rather a small number of hidden layer units, in the sense of the KullbackLeibler information.
4
Multicategory Case and Discussions
The log ratio of the a posteriori probabilities is the log likelihood ratio biased by a constant, and the log likelihood ratios of many probability distributions of the exponential family are polynomials of low degrees. In the case of the binomial, polynomial or gamma distribution, the ratio is a linear function in x and, in the case of the normal distribution, it is a quadratic form. Consequently, if the probability distribution is one of them and the neural network has at least d + 1 hidden layer units, its output may approximate the a posteriori probability in the case of two category classiﬁcation. It is unnecessary to teach the network the type of the probability distribution beforehand. Contrarily, we may obtain information about the probability distribution, observing the output after training. For simplicity, we now restrict discussions to the case n = 2. Let N = 12 d(d + 1). There are N kinds of quadratic monomials and the same number of linearly independent squares (Vi ·x)2 . Hence, any number of nonhomogeneous polynomial of degree 2 in x can be expressed as a sum N i=1
ai (Vi · x)2 +
d
b i xi + c
i=1
simultaneously by adjusting only ai , bi , c, if Vi are appropriately chosen befored hand. A linear sum αi ϕ(Vi · x)+ Σj=1 αij xj + αi0 can approximate (Vi · x)2 . Hence, a neural network having d linear input units, N + d = 12 d(d + 3) hidden
140
Yoshifusa Ito and Cidambi Srinivasan
layer units and c linear output units can approximate c nonhomogeneous polynomials of degree 2 simultaneously, if the connection weights between the input and hidden layers are ﬁxed appropriately beforehand. Hence, only adjustment of the weights between the hidden and output layers is necessary, which may be advantageous. In the case where the stateconditional p.d.fs p(xωi ) are normal distributions, pairwise decision functions gij (x) = gi (x) − gj (x) are quadratic forms and σ(gij (x)) = P (ωi x) (see Funahashi (1998)). Consequently, if the output units are replaced by those having σ as activation function, the outputs approximate P (ωi x). One way to estmate P (ωi x) may be to take an average of σ(¯ gij (x)), j = 1, · · · , c, j =i, where g¯ij are approximations of gij by the network respectively. Once we can approximate P (ωi x), i = 1, · · · , c, it is not diﬃcult to decide the category ωi whose a posterior probability is the maximum. In the KullbackLeibler information (1), the diﬀerence between FW (ωi ·) and P (ωi ·) over Rd × Ω is measured with the homogeneous weights with respect to ωi . However, there may be a case where nonhomogeneous weight is preferable in applications. Then, a weighted average c
ai Km (P (ωi ·)FW (ωi ·))
i=1
can be used to estimate the diﬀerence.
References 1. Duda, R.O. and Hart, P.E.: Pattern classiﬁcation and scene analysis. John Wiley & Sons, New York (1973) 2. Funahashi, K.: Multilayer neural networks and Bayes decision theory. Neural Networks 11 (1998) 209213 3. Hinton, G.E.: Connectionist learning procedures. Artiﬁcial intelligence 40 (1989) 185234 4. Ito, Y.: Simultaneous Lp approximations of polynomials and derivatives on the whole space. Proceedings of ICANN99 (1999) 587592 5. Ito, Y., Srinivasan, C.: Bayesian decision theory and three layered neural networks, Proceedings of ESANN2001 (2001) 377382 6. Richard, M.D. and Lippmann, R.P.: Neural network classiﬁers estimate Bayes ian a posteriori probabilities. Neural Compt. 3 (1991) 461483 7. Ripley, B.D.:Statistical aspect of neural networks. Networks and chaos  Statistical and Probabilistic Aspects. ed. BarndorﬀNielsen, O.E., Jensen, J.L., Kendall, W.S., Chapman & Hall. London (1993) 40123 8. Ruck,M.D., Rogers, S., Kabrisky, M., Oxley, H., Sutter, B.: The multilayer perceptron as an approximator to a Bayes optimal discriminant function. IEEE Transactions on Neural Networks. 1 (1990) 296298 9. M.J. Schervish, Theory of statistics. SpringerVerlag, Berlin, New York (1995)
The BiasVariance Dilemma of the Monte Carlo Method Zlochin Mark and Yoram Baram Technion  Israel Institute of Technology, Technion City, Haifa 32000, Israel {zmark,baram}@cs.technion.ac.il
Abstract. We investigate the setting in which Monte Carlo methods are used and draw a parallel to the formal setting of statistical inference. In particular, we ﬁnd that Monte Carlo approximation gives rise to a biasvariance dilemma. We show that it is possible to construct a biased approximation scheme with a lower approximation error than a related unbiased algorithm.
1
Introduction
Markov Chain Monte Carlo methods have been gaining popularity in recent years. Their growing spectrum of applications ranges from molecular and quantum physics to optimization and learning. The main idea of this approach is to approximate the desired target distribution by an empirical distribution of a sample generated through the simulation of an ergodic Markov Chain. The common practice is to construct a Markov Chain in such a way as to insure that the target distribution is invariant. Much eﬀort has been devoted to the development of general methods for the construction of such Markov chains. However, the fact that approximation accuracy is often lost when invariance is imposed has been largely overlooked. In this paper we make explicit the formal setting of the approximation problem, as well as the required properties of the approximation algorithm. By analogy to statistical inference, we observe that the desired property of the algorithms is a good rate of convergence, rather than unbiasedness. We demonstrate this point by numerical examples.
2 2.1
Monte Carlo Method The Basic Idea
In various ﬁelds we encounter the problem of ﬁnding the expectation: a ˆ = E[a] = a(θ)Q(θ)dθ
(1)
where a(θ) is some parameterdependent quantity of interest and Q(θ) is the parameter distribution (e.g. in Bayesian learning a(θ) can be the vector of output values for the test cases and Q(θ) is the posterior distribution given the data). G. Dorﬀner, H. Bischof, and K. Hornik (Eds.): ICANN 2001, LNCS 2130, pp. 141–147, 2001. c SpringerVerlag Berlin Heidelberg 2001
142
Zlochin Mark and Yoram Baram
Since an exact calculation is often infeasible, an approximation is made. The basic Monte Carlo estimate of a ˆ is: n
1 a(θi ) a ¯n = n i=1
(2)
where θi are independent and distributed according to Q(θi ). In the case where sampling from Q(θ) is impossible, an importance sampling [2] can be used: n
a ¯n =
1 a(θi )w(θi ) n i=1
(3)
i) where θi are distributed according to some P0 (θi ) and w(θi ) = PQ(θ . 0 (θi ) In many problems, especially high dimensional ones, independent sampling cannot provide accurate estimates in reasonable time. An alternative is the Markov Chain Monte Carlo (MCMC) approach, where the same estimate as in (2) is used, but the sampled values θ(i) are not drawn independently. Instead, they are generated by a homogeneous ergodic distribution Q [5].
2.2
Approximation Error and the BiasVariance Dilemma
Let us consider a yet more general approximation scheme where the invariant distribution of the Markov Chain is not Q but some other distribution Q . Moreover, we may consider nonhomogeneous Markov Chains with the transition probabilities Tt (θ(t+1)  θ(t) ) depending on t. Since a ¯n is a random variable, its bias, equilibrium bias and variance are deﬁned as follows: an } − a ˆ Biasn = E{¯ Biaseq = limn→∞ Biasn V arn = E{¯ an − E{¯ an }}2 where E denotes expectation with respect to the distribution of a ¯n . In addition, we may deﬁne the initialization bias, Biasinit = Biasn −Biaseq , n as a measure of how far the Markov Chain is from equilibrium. The common practice is to try to construct a homogeneous (i.e. T (·  ·) independent of time) reversible Markov Chain with Q(θ) as invariant distribution, hence with the equilibrium bias equal to zero. However the quality of the approximation is measured not by the bias, but by an average squared estimation error: 2 Errn = E[¯ an − a ˆ]2 = Bias2n + V arn = (Biaseq + Biasinit n ) + V arn
(4)
From the last equation it can be seen that the average estimation error has three components. Therefore, there is a possibility (at least potentially) to reduce the total approximation error by balancing those three. Moreover it shows that, since the initialization bias and the variance depend on the number of iterations, the algorithm may depend on the number of iterations as well.
The BiasVariance Dilemma of the Monte Carlo Method
143
Batch Versus Online Estimation. A similar tradeoﬀ between the components of the error appears in statistical inference, where it is known as “the biasvariance dilemma”. The analogy to the statistical inference setting can be further extended by making a distinction between two models of MCMC estimation: Online estimation  the computation time is potentially unlimited and we are interested in as good an approximation as possible at any point in time or, alternatively, as rapid a rate of convergence of Errn to zero as possible . Batch estimation  the allowed computation time is limited and known in advance and the objective is to design an algorithm with as low an average estimation error as possible. Note that the batch estimation paradigm is more realistic than the online estimation, since, in practice, the time complexity is of primary importance. In addition, in many instances of Bayesian learning, the prior distribution is not known but is chosen on some adhoc basis and the data is usually noisy. Therefore, it makes little sense to look for a timeexpensive high accuracy approximation to the posterior. Instead, a cheaper rough approximation, capturing the essential characteristics of the posterior, is needed. The traditional MCMC method overlooks the distinction between the two models. In both cases, the approximation is obtained using a homogeneous (and, usually, reversible) Markov Chain with the desired invariant distribution. It should be clear, however, that such an approach is far from optimal. In the online estimation model, a more reasonable alternative would be considering a nonhomogeneous Markov Chain that rapidly converges to a rough approximation of Q and whose transition probabilities are modiﬁed with time so as to insure the asymptotic invariance of Q (hence consistency). A general method for designing such Markov Chains is described in the next subsection. In the batch estimation model the invariance of Q may be sacriﬁced (i.e. equilibrium bias may be nonzero) if this facilitates a lower variance and/or initialization bias for the ﬁnite computation time. Mixture Markov Chains. Suppose that we are given two Markov Chains  the ﬁrst, M1 , is unbiased, while the second, M2 , is biased, but more rapidly mixing. Let their transition probabilities be Tt1 and Tt2 respectively. It will usually be the case that for small sample sizes, M2 will produce better estimates, but as n grows, the inﬂuence of the bias becomes dominant, making M1 preferable. In order to overcome this diﬃculty, we may deﬁne a mixture Markov chain with mixing probabilities pt , for which the transition is made according to Tt1 with probability 1 − pt and according to Tt2 with probability pt . If the sequence {pt } starts with p1 = 1 and decreases to zero, then for small t the resulting Markov chain behaves similarly to M2 , hence producing better estimates than M1 , but as t grows, its behavior approaches that of M1 (in particular, the resulting estimate is asymptotically unbiased). Clearly, if both Markov Chains are degenerate (i.e. produce IID states), it can be easily seen that the bias behaves asymptotically as n 1 Biasn = O( pi ) (5) n i=1
144
Zlochin Mark and Yoram Baram
and it can be shown that (5) also holds for any uniformly ergodic Markov Chains. In order to balance between the variance, which typically decreases as O( n1 ), and the bias, the sequence {pi } should be chosen so that Biasn = O( √1n ), i.e. n √ i=1 pi = O( n). This mixture Markov Chain approach allows to design eﬃcient sampling algorithms both in online and in batch settings (in the latter case the sequence {pi } may depend on the sample size, n).
3
Examples
In this section we present two approaches for designing approximation algorithms with a small controlled bias. Section 3.1 describes a postprocessing scheme based on the smoothing techniques, which reduces the variability of the importance weights in the Annealed Importance sampling algorithm [6]. Section 3.2 compares the wellknown Hybrid Monte Carlo algorithm with a simple biased modiﬁcation and a mixture Markov Chain. In both cases, the empirical comparison shows superiority of the biased algorithms. 3.1
Independent Sampling
Annealed Importance Sampling. In the Annealed Importance sampling algorithm [6], the sample points are generated using some variant of simulated annealing and then the importance weights, w(i) , are calculated in a way that insures asymptotic unbiasedness of the weighted average. The annealed importance sampling (AIS) estimate of EQ [a] is: a ¯n =
n
a(x(i) )w(i) /
i=1
n
w(i)
(6)
i=1
Smoothed Importance Sampling. While implicitly concentrating on the regions of high probability, AIS estimate can still have a rather high variance, since its importance weights are random and their dispersion can be large. Let us observe that EQ [a] may be written as EP (wa) [w] EP [aw] = EP (a) a (7) EQ [a] = EP [w] EP [w] where a = a(x) and P is the distribution deﬁned by the annealing algorithm. The estimate (6) is simply a ﬁnite sample approximation to the ﬁrst part of (7). Next, the second part of (7) suggests using an estimate: a ˆn =
n
(i) a(x(i) )w(a(x ˆ ))
(8)
i=1 E
[w]
. on the value of a). It can be shown that where w(a) ˆ is an estimate of PE(wa) P [w] if w(a) ˆ was known exactly, than (8) would have a lower variance than (6). In
The BiasVariance Dilemma of the Monte Carlo Method
145
Table 1. Average estimation error of posterior output prediction for diﬀerent sample sizes. The results for sample size n are based on 104 /n trials. n Annealed IS Smoothed IS 10 3.04 1.09 100 0.694 0.313 1000 0.0923 0.0594
practice, a good estimate of w(a) ˆ can be obtained using some data smoothing algorithm such as kernel smoothing [3]. The resulting method henceforth is referred to as Smoothed Importance sampling. The usage of smoothing introduces bias (which can be made arbitrary small by decreasing the smoothing kernel width to zero as n → ∞), but in return can lead to a signiﬁcant reduction of the variance, hence, a smaller estimation error than (6), as demonstrated by the following numerical experiment. A Bayesian Linear Regression Problem. The comparison was carried out using a simple Bayesian learning problem from [6]. The data for this problem consisted of 100 independent cases, each having 10 realvalued predictor variables, 1 , . . . , x10 , and a realvalued response variable, y, which is modeled by x 10 y = k=1 βk xk + , where is zeromean Gaussian noise with unknown variance σ 2 . The detailed description of the datageneration model can be found in [6]. The two methods were used to estimate the posterior prediction for a test set of size 100. The “correct” posterior prediction were estimated using Annealed Importance sampling with sample size 10000. As can be seen in Table 1, the average estimation error of Smoothed Importance sampling was uniformly lower than that of Annealed Importance sampling. It should also be noted that the modeling error, i.e. the squared distance between the Bayesian prediction and the correct test values, was 5.08, which is of the same order magnitude as the estimation error for n = 10. This conﬁrms our claim that, in the context of Bayesian learning, high accuracy approximations are not needed. 3.2
MCMC Sampling
HMC Algorithm. The Hybrid Monte Carlo (HMC) algorithm [1] is one of the stateoftheart asymptotically unbiased MCMC algorithms for sampling from complex distributions. The algorithm is expressed in terms of sampling from a canonical distribution deﬁned in terms of the energy function E(q): P (q) ∝ exp(−E(q))
(9)
To allow the use of dynamical methods, a “momentum” variable, p, is introduced, with the same dimensionality as q. The canonical distribution over the “phase space” is deﬁned to be: n p2i P (q, p) ∝ exp(−H(q, p)) = exp(−E(q) − ) (10) 2 i=1
146
Zlochin Mark and Yoram Baram
where pi , i = 1, . . . , n are the momentum components. Sampling from the canonical distribution can be done using the stochastic dynamics method, in which simulation of the Hamiltonian dynamics of the system using leapfrog discretization is alternated with Gibbs sampling of the momentum. In the hybrid Monte Carlo method, the bias, introduced by the discretization, is eliminated by applying the Metropolis algorithm to the candidate states, generated by the stochastic dynamic transitions. The candidate state is accepted with probability max(1, exp(∆H)), where ∆H is the diﬀerence between the Hamiltonian in the beginning and and the end of the trajectory [5]. A modiﬁcation of the Gibbs sampling of the momentum, proposed in [4], is to replace p each time by p · cos(θ) + ζ · sin(θ), where θ is a small angle and ζ is distributed according to N (0, I). While keeping the canonical distribution invariant, this scheme, called momentum persistence, allows the use of shorter (hence cheaper) trajectories. However, in order to insure invariance of the target distribution, the momentum has to be reversed in case the candidate state has been rejected by the Metropolis step. This causes occasional backtracking and slows the sampling down. Stabilized Stochastic Dynamics. As an alternative to the HMC algorithm we consider stochastic dynamics with momentum persistence, but without the Metropolis step. In order to insure stability of the algorithm, if the momentum size grows beyond a certain percentile of its distribution (say, 99%), it is replaced using Gibbs sampling. The resulting Stabilized Stochastic Dynamics (SSD) algorithm is biased. However, since it avoids backtracking, it can produce a lower estimation error, as found in experiments described next. Bayesian Neural Networks. We compared the performances of SSD, HMC and mixture Markov Chain with √ 1, if t ≤ 200n n (11) pt = 0, otherwise on the following Bayesian learning problem. The data consisted of 30 inputoutput pairs, generated by the following model: yi =
sin(2.5xi ) + i , i = 1, . . . , 20 2.5xi 
(12)
where xi are independently and uniformly generated from [−π, π] and i are Gaussian zeromean noise variables with standard deviation 0.1. We have used multilayer perceptron with one input node, one hidden layer containing 5 tanh nodes and one linear output node. The prior for each weight was taken to be zeromean Gaussian with inverse variance 0.1. The comparison was carried out using two test statistics: average logposterior and average prediction for 100 equidistant points in [−π, π]. Since the correct values of those statistics are not known, they were estimated using a sample of
The BiasVariance Dilemma of the Monte Carlo Method
147
Table 2. Average estimation error of mean logposterior and mean output prediction for diﬀerent sample sizes. n 1000 3000 10000
Logposterior SSD HMC MIX 3.55 8.52 5.51 0.75 1.34 0.85 0.23 0.43 0.37
Output prediction SSD HMC MIX 0.053 0.102 0.061 0.015 0.031 0.024 0.0049 0.0089 0073
size 106 generated by HMC. For each algorithm we performed 100 runs with initial state generated from the prior. As can be seen from the results in Table 2, both SSD and mixture Markov Chain produced results which are uniformly better than those of HMC. It may be hypothesized that, for very large sample sizes, HMC will become superior to SSD, because of the inﬂuence of the bias. However, the mixture Markov Chain for these large sample sizes is expected to produce results which are very similar to HMC. More importantly, the large sample behavior of the algorithms is not very relevant, as the modeling error, i.e. the squared distance between the Bayesian prediction and the correct test values, in this experiment was 0.187, which is larger than the estimation error for n = 1000, meaning that, once again, there is no use in making a high accuracy large sample approximation.
4
Conclusion
We have shown that in Monte Carlo estimation there is a “biasvariance dilemma”, which has been largely overlooked. By means of numerical examples, we have demonstrated that bias correcting procedures can, in fact, increase the estimation error. As an alternative, approximate (possibly asymptotically biased) estimation algorithms, with lower variance and convergence bias should be considered. Once the unbiasedness requirement is removed, a whole range of new possibilities for designing sampling algorithms opens up, such as automatic discretization step selection, diﬀerent annealing schedules, alternative discretization schemes etc.
References 1. S. Duane, A. D. Kennedy, B. J. Pendleton, and D. Roweth. Hybrid Monte Carlo. Physics Letters B, 195:216–222, 1987. 2. W. R. Gilks, S. Richardson, and D. J. Spiegelhalter, editors. Markov Chain Monte Carlo in Practice. Chapman&Hall, 1996. 3. W. Hardle. Applied nonparametric regression. Cambridge University Press, 1990. 4. A. M. Horowitz. A generalized guided Monte Carlo algorithm. Physics Letters B, 268:247–252, 1991. 5. R. M. Neal. Bayesian Learning for Neural Networks. SpringerVerlag, 1996. 6. R. M. Neal. Annealed importance sampling. Technical Report No. 9805, Dept. of Statistics, University of Toronto, 1998.
A Markov Chain Monte Carlo Algorithm for the Quadratic Assignment Problem Based on Replicator Equations Takehiro Nishiyama, Kazuo Tsuchiya, and Katsuyoshi Tsujita Dept. of Aeronautics and Astronautics, Graduate School of Engineering, Kyoto University, Kyoto, Japan {nisiyama,tsuchiya,tsujita}@kuaero.kyotou.ac.jp
Abstract. This paper proposes an optimization algorithm for the Quadratic Assignment Problem (QAP) based on replicator equations. If the growth rate of a replicator equation is composed of the performance index and the constraints of the QAP suitably, by increasing the value of the control parameter in the growth rate, the equilibrium solutions which correspond to the feasible solutions of the QAP become stable in order from the one with smaller value of the performance index. Based on the characteristics of the system, the following optimization algorithm is constructed; the control parameter is set so that the equilibrium solutions corresponding to the feasible solutions with smaller values of the performance index become stable, and then in the solution space of the replicator equations, a Markov chain Monte Carlo algorithm is carried out. The proposed algorithm is applied to many problem instances in the QAPLIB. It is revealed that the algorithm can obtain the solutions equivalent to the best known solutions in short time. Especially, for some large scale instances, the new solutions with the same cost as the best known solutions are obtained.
1
Introduction
The Quadratic Assignment Problem (QAP) [1] is one of the hardest combinatorial optimization problems. The QAP is formulated as a problem to ﬁnd a N dimensional permutation matrix which minimizes a performance index, where N is the size of the problem. Among approximation methods for the QAP, there are dynamical systems approaches; a dynamical system consisting of the elements of a N × N matrix is constructed. The mutual interactions between the elements are determined so that equilibrium solutions of the system become permutation matrices with smaller values of the performance index, i.e. approximate solutions for the QAP. In many studies, the dynamical system is constructed as a gradient system [2,3]; a potential function is composed of the performance index and the constraints, and the dynamical system is constructed as a gradient vector ﬁeld of the potential function. The system has equilibrium solutions as the minima of the potential function which correspond to the approximate solutions of the QAP. On the other hand, we have constructed the dynamical G. Dorﬀner, H. Bischof, and K. Hornik (Eds.): ICANN 2001, LNCS 2130, pp. 148–155, 2001. c SpringerVerlag Berlin Heidelberg 2001
A Markov Chain Monte Carlo Algorithm for the QAP
149
system as a replicator equation [4]. The replicator equation is the equation in which derivatives of the variables are proportional to the state of the variables. The proportional coeﬃcients are called growth rates. When the growth rates are determined based on the performance index and the constraints of the QAP suitably, the system has the following characteristics; all feasible solutions of the QAP are the equilibrium solutions of the system. By increasing the value of the parameter (control parameter) in the growth rate, the feasible solutions become stable in order, from the one having the smallest value of the performance index to the largest one. This means that when the control parameter is set suitably, the dynamical system has only solutions with smaller values of the performance index as the stable solutions. In this paper, we propose the following Markov chain Monte Carlo algorithm based on the characteristics of the replicator equations; the growth rates of the replicator equations are designed based on the performance index and the constraints of the QAP. The control parameter in the growth rate is set so that the equilibrium solutions corresponding to the feasible solutions with smaller values of the performance index become stable. The replicator equations are calculated with some initial values to obtain an equilibrium solution. Then, setting the initial values in some neighborhood of the solution, the replicator equations are calculated again to obtain the next equilibrium solution. The obtained solution is accepted according to some probability. This procedure is repeated to give a sequence of solutions. The proposed algorithm is applied to some problem instances in the QAPLIB [5], and in many cases, gives solutions comparable to the best known solutions. Especially, for some large scale instances, the proposed algorithm gives new solutions having the same values of the performance index as the best known solutions. Haken et al. have proposed an optimization algorithm based on the replicator equations [6]. But in their method, the abovementioned characteristics of the equilibrium solutions of the replicator equations are not used explicitly. On the other hand, Ishii and Niitsuma have proposed a dynamical systems approach in which the search space is restricted [7]. But their method does not utilize the characteristics that the search space is composed of the good approximate solutions of the QAP which our method utilizes to improve the performance.
2
Quadratic Assignment Problem (QAP)
The Quadratic Assignment Problem (QAP) [1] is considered one of the hardest combinatorial optimization problems. Given a set N = {1, 2, · · · , N } and N × N matrices A = (aij ), B = (bkl ), the QAP is deﬁned as follows: min L(p), L(p) =
p∈ΠN
aij bp(i)p(j) ,
(1)
i,j
where ΠN is the set of all permutations of N , and p is an element of it. Letting ΠN ×N the set of all N × N permutation matrices and X = (xij ) an element of
150
Takehiro Nishiyama, Kazuo Tsuchiya, and Katsuyoshi Tsujita
it, the QAP is also represented as the following matrix form: ajj bii xij xi j . min L(X), L(X) = trace(AT X T BX) = X∈ΠN ×N
(2)
i,i ,j,j
A typical example of the QAP is the facility location problem; consider assigning N facilities to N locations, where aij represents the ﬂow of materials from facility i to facility j and bkl is the distance from location k to location l. The cost of assigning facility i to location k and facility j to location l is aij bkl . The objective of the problem is to ﬁnd an assignment of all facilities to all locations such that the total cost is minimized.
3
Proposed Dynamical System and Its Characteristics [4]
For the QAP, we have proposed the following replicator equation: u˙ ij = fij (ui j , α0 , α1 )uij , α0 2 ui j + u2ij fij = (1 − u2ij ) − 2 i =i j =j α1 (ajj bii + aj j bi i )u2i j − 2
(3a)
(3b)
i ,j
(i, j = 1, · · · , N ), where fij is called the growth rate, and the parameters are α0 > 0, 0 ≤ α1 1. The ﬁrst term of the growth rate fij leads each u2ij to unity. The second term represents the competition between elements having same subscripts i (j), and the parameter α0 determines the strength of the competition. The third term is derived from the gradient of the performance index: 1 ∂L(U ) = (ajj bii + aj j bi i )u2i j uij , 2 ∂uij
(4)
i ,j
where U = (u2ij ), and suppresses the solutions having larger values of the performance index. (p) The dynamical system (3) has equilibrium solutions uij (p ∈ ΠN ): (p)2 uij
1 (i = p(j)) (∀i). = 0 (i = p(j))
(5)
(p)
The solutions U (p) = (uij ) are called the feasible solutions which corresponds (p)
to permutation matrices X (p) = (xij ): (p) xij
= 1 (i = p(j)) (∀i). = 0 (i = p(j))
(6)
Number of solutions
A Markov Chain Monte Carlo Algorithm for the QAP 100 80 60 40 20 0
0
5 10 15 20 25 30 35 40 45 50 Relative difference (100(L−Lopt)/Lopt) (%) (a)
Number of solutions
151
0 = 1:01
100 80 60 40 20 0
0
5 10 15 20 25 30 35 40 45 50 Relative difference (100(L−Lopt)/Lopt) (%) (b)
0 = 3:0
Fig. 1. Distribution of the feasible solutions (Lopt is the optimal value of L)
The stability condition for a feasible solution U (p) corresponding to a permutation matrix X (p) is approximately given as follows: α1 ¯ ¯ constant (L(X (p) ) − L), α0 > 1 + L: (7) N −1 The condition (7) indicates that when α0 is close to 1, only feasible solutions having smaller values of the performance index L, i.e. good approximate solutions of the QAP, are stable. To verify the stability condition (7), the dynamical system (3) is computed with many sets of random initial values using a problem instance called “Nug20” (N = 20) in the QAPLIB [5]. Figures 1 (a), (b) are the results for α0 = 1.01 and α0 = 3.0 respectively. In the case of α0 = 1.01, only good solutions are obtained as compared with the case of α0 = 3.0.
4
Optimization Algorithm
Based on the above analysis, we propose an optimization algorithm for the QAP using the Markov chain Monte Carlo algorithm.
152
Takehiro Nishiyama, Kazuo Tsuchiya, and Katsuyoshi Tsujita 1. 2. 3. 4. 5. 6.
M := size of the neighborhood T0 := initial temperature X (p0 ) := initial permutation matrix n := 0 (iteration) α0 := 1 + ( 1) while n < nmax do 6.1. Choose M rows and columns randomly in the matrix U 6.2. Give random initial values to the corresponding M × M elements uij where the rest of the elements are ﬁxed to the value of U (pn ) 6.3. Calculate (3) to obtain the equilibrium solution U (pn+1 ) and corresponding permutation matrix X (pn+1 ) . If the solution is not feasible, go back to Step 6.1 6.4. Apply 2opt method to X (pn+1 ) (pn+1 ) )−L(X (pn ) )]+ /Tn 6.5. Accept X (pn+1 ) with probability e−[L(X , where [a]+ = max{a, 0}. If rejected, X (pn+1 ) := X (pn ) , U (pn+1 ) := U (pn ) 6.6. Tn+1 = b · Tn (b < 1) 6.7. n := n + 1 Fig. 2. Algorithm
First, the search space is constructed as follows; the replicator equation (3) is derived according to the given QAP. The control parameter α0 is set close to 1 so that (3) has stable equilibrium solutions U (p) corresponding to permutation matrices X (p) with smaller values of the performance index of the QAP. The search space is composed of these stable equilibrium solutions U (p) . Next, given a solution U (pn ) corresponding to the permutation matrix X (pn ) , new solution U (pn+1 ) is searched in the ‘M neighborhood’ of the current solution as follows; M rows and columns are randomly chosen in the matrix U of (3). Random values are given to the corresponding elements uij as the initial values and then (3) is calculated, where the rest of the elements are ﬁxed to the values of U (pn ) . The new solution U (pn+1 ) and the corresponding permutation matrix X (pn+1 ) are obtained. A local search method called the 2opt method is applied to the obtained solution. The 2opt method is a simple heuristic method and adopted to slightly modify the solution to be the local minimum in the 2neighborhood of the solution. The new solution X (pn+1 ) is accepted or rejected based on the Metropolis method [8]; if the performance index value is decreased by the change of the solution, the new solution is accepted, and if the value is increased, the new solution is accepted with the probability exp(−∆L/T ) according to the amount of the increase ∆L. The parameter T called the temperature is decreased every step of the algorithm by multiplying a constant b (< 1). The whole procedure of the proposed algorithm is shown in Fig. 2. This algorithm is not a true simulated annealing [9] in two respects. First, since the new solution is searched in the M neighborhood of the current solution among the stable equilibrium solutions of (3), the ergodicity does not always hold. Second, since the obtained solution is modiﬁed by the 2opt method in each step of the algorithm, the detailed balance is not satisﬁed. Therefore, the
A Markov Chain Monte Carlo Algorithm for the QAP
153
Table 1. Solutions of the proposed algorithm (%) M
Mean
Standard deviation
Minimum
5 10 15
0.014 0.0021 0.015
0.022 0.0019 0.027
0 0 0
Table 2. Solutions of the proposed algorithm with dynamics and without dynamics (random) (%)
Dynamics Random
Mean
Standard deviation
Minimum
0.0021 0.015
0.0019 0.017
0 0
convergence properties of the proposed algorithm are not guaranteed. But, numerical experiments mentioned below reveal that the algorithm can obtain good solutions for many instances in the QAPLIB.
5
Numerical Experiments
Numerical experiments are carried out using some large scale problem instances in the QAPLIB. First, the eﬀect of the size M of the neighborhood was checked using an instance “Wil100” (N = 100). The parameters are α0 = 1.01, α1 = 0.003, b = 0.99995, nmax = 50000, and T0 = 300.0. The resulted means, standard deviations and minima of ten separate trials for each of M = 5, 10, 15 are shown in Table 1. These values are the relative diﬀerences 100(L − Lopt )/Lopt (%) from the best known solution Lopt . The best result was obtained when M = 10. The reason is considered that good parts of the solution may be changed when a new solution is obtained if M is too large. But it is necessary to consider further about the optimum value of M . Next, we compared the proposed method with the method where the new solution is randomly generated in the neighborhood of the current solution. The computation was carried out using the problem instance “Wil100” with M = 10. The results are shown in Table 2. The proposed algorithm gives better performance on average. This indicates that the proposed method searches eﬀectively in the relatively large neighborhood using the dynamical system. Finally, the performance of the proposed algorithm was veriﬁed using some large scale problem instances in the QAPLIB, i.e. “Sko100a”, “Sko100b”, “Sko100f”, “Wil100”, “Tai100a” (N = 100) and “Tho150” (N = 150). The results are shown in Table 3, where Lopt is the best known solution given in the QAPLIB and in the parentheses the name of the algorithms which gave the solutions, i.e. Genetic Hybrids (GEN), Reactive Tabu Search (ReTS), Simulated
154
Takehiro Nishiyama, Kazuo Tsuchiya, and Katsuyoshi Tsujita Table 3. Performance Name Sko100a Sko100b Sko100f Tai100a wil100 Tho150
N 100 100 100 100 100 150
Lopt (algorithm) 152002 153890 149036 21125314 273038 8133484
(GEN) (GEN) (GEN) (ReTS) (GEN) (SIMJ)
L 152002 153890 149036 21146176 273038 8135474
Diﬀerence(%) 0 0 0 0.099 0 0.024
Jumping (SIMJ), are shown. L is the solution by the proposed algorithm and in the last column the relative diﬀerence from Lopt is shown. The solutions having the same values of the performance index as the best known solutions were obtained for the four of the six problem instances. 1 In the proposed method, since the size of the neighborhood is set to M (∼ 10), the dynamical system with only M × M elements is calculated even if the size of the problem is N (∼ 100). Therefore the computation time for one step of the algorithm is very short. On a COMPAQ AlphaStation XP900 computer, the total computation time for 50000 steps of the algorithm was only about 1–2 hours.
6
Conclusion
In this paper, we proposed a Markov chain Monte Carlo algorithm based on the replicator equations. In the proposed dynamical system, only good approximate solutions are obtained by appropriately setting a control parameter in the system. Therefore, using the system, good solutions are eﬃciently searched in relatively large neighborhood in each step of the algorithm. The proposed algorithm is applied to some large scale benchmark problems of the QAP. It was shown that the algorithm can obtain the solutions equivalent to the best known solutions in short time. Especially, the new solutions with the same performance index values as the best known solutions were obtained for some problem instances.
References 1. Pardalos, P. M., Rendl, F. and Wolkowicz, H.: The Quadratic Assignment Problem: A Survey and Recent Developments. In: Pardalos, P. and Wolkowicz, H. (eds.): Quadratic Assignment and Related Problems, Vol. 16 of DIMACS Series in Discrete Mathematics and Theoretical Computer Science. American Mathematical Society (1994) 1–42 1
The obtained permutation matrices were diﬀerent from the ones given in the QAPLIB.
A Markov Chain Monte Carlo Algorithm for the QAP
155
2. Hopﬁeld, J. J. and Tank, D. W.: “Neural” Computation of Decisions in Optimization Problems. Biological Cybernetics 52 (1985) 141–152 3. Ishii, S. and Sato, M.: Constrained Neural Approaches to Quadratic Assignment Problems. Neural Networks 11 (1998) 1073–1082 4. Tsuchiya, K., Nishiyama, T. and Tsujita, K.: A Deterministic Annealing Algorithm for a Combinatorial Optimization Problem Using Replicator Equations. Physica D 149 (2001) 161–173 5. Burkard, R. E., Karisch, S. and Rendl, F.: QAPLIB – A Quadratic Assignment Problem Library. Journal of Global Optimization 10 (1997) 391–403 6. Haken, H., Schanz, M. and Starke, J.: Treatment of Combinatorial Optimization Problems using Selection Equations with Cost Terms  Part I: TwoDimensional Assignment Problems. Physica D 134 (1999) 227–241 7. Ishii, S. and Niitsuma, H.: λopt Neural Approaches to Quadratic Assignment Problems. Neural Computation 12 (2000) 2209–2225 8. Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H. and Teller, E.: Equation of state calculation by fast computing machines. Journal of the Chemical Physics 21 (1953) 1087–1092 9. Kirkpatrick, S., Gelatt, C. D., Jr. and Vecchi, M. P.: Optimization by Simulated Annealing. Science 220 (1983) 671–680
Mapping Correlation Matrix Memory Applications onto a Beowulf Cluster Michael Weeks, Jim Austin, Anthony Moulds, Aaron Turner, Zygmunt Ulanowski, and Julian Young Advanced Computer Architecture Group, Computer Science Department, University of York, Heslington, York, UK {mweeks,austin}@cs.york.ac.uk Abstract. The aim of the research reported in this paper was to assess the scalability of a binary correlation Matrix Memory (CMM) based on the PRESENCE (PaRallEl StructurEd Neural Computing Engine) architecture. A single PRESENCE card has a ﬁnite memory capacity, and this paper describes how multiple PCIbased PRESENCE cards are utilised in order to scale up memory capacity and performance. A Beowulf class cluster, called Cortex1, provides the scalable I/O capacity needed for multiple cards, and techniques for mapping applications onto the system are described. The main aims of the work are to prove the scalability of the AURA architecture, and to demonstrate the capabilities of the architecture for commercial pattern matching problems.
1
Introduction
This paper investigates methods to distribute patternmatching tasks over a number of parallel CMMs, implemented as hardware PRESENCE cards within a Beowulf cluster. In theory, the performance of the hardware CMM should scale with the number of PRESENCE cards in use. However, this project will provide results on the eﬀects of software, device driver, and communications overheads on performance. It must be noted that the distributed CMM project is still in progress, with results still forthcoming. The paper ﬁrst introduces the AURA architecture and the history behind it. A brief overview of the PRESENCE PCIcard is then given, followed later by techniques for mapping large CMM applications over multiple PRESENCE cards in the Cortex1 cluster. Next, the speciﬁcation of Cortex1 is given, as well as the reasons for choosing a Beowulf class cluster. The AURA library is then discussed followed by the seamless method by which the distributed CMM is implemented over the cluster. The paper closes with conclusions from the work.
2
AURA
AURA (Advanced Uncertain Reasoning Architecture) is a generic family of techniques and implementations intended for highspeed approximate search and G. Dorﬀner, H. Bischof, and K. Hornik (Eds.): ICANN 2001, LNCS 2130, pp. 156–163, 2001. c SpringerVerlag Berlin Heidelberg 2001
Mapping CMM Applications onto a Beowulf Cluster
157
match operations on large unstructured datasets [4]. AURA technology is fast, economical, and oﬀers unique advantages for ﬁnding nearmatches not available with other methods. AURA is based on a highperformance binary neural network called Correlation Matrix Memory (CMM). Typically, several CMM elements are used in combination to solve soft or fuzzy patternmatching problems. AURA takes large volumes of data and constructs a special type of compressed index. AURA ﬁnds exact and nearmatches between indexed records and a given query, where the query itself may have omissions and errors. The degree of nearness required during matching can be varied through thresholding techniques. AURA implicitly supports powerful combinatorial queries, which accept a match between, for example, any 5 from 10 ﬁelds in the query against the stored records. A degree of index compression is preset, allowing a tradeoﬀ between storage eﬃciency and accuracy of recall. In practice, this means that AURA is guaranteed to ﬁnd all genuine matches but will typically ﬁnd additional false matches, depending on the degree of index compression used. The false matches are easily detected in the relatively small result set by conventional (but computationally slow) matching techniques. The increasing range of applications for AURA includes: – – – – –
postal address matching; highspeed rulematching systems [5]; highspeed classiﬁers (e.g. novel kNN implementations) [9]; structurematching (e.g. 3D molecular structures) [8]; trademarkdatabase searching [1].
Other applications under development include data analysis and casebased reasoning systems.
3
PRESENCE Hardware
This section gives a brief overview of the operation of the PRESENCE architecture. A more detailed discussion was presented at MicroNeuro’99 [7]. PRESENCE is the current family of hardware designs which accelerate the core CMM computations needed in AURA applications. PRESENCE incorporates several improvements over the ﬁrst hardware prototype: – – – –
5stage pipelined SumandThreshold (SAT) processor; 128 concurrent bitsummations; 50 nanosecond summing cycle; 128 MByte DRAM memory.
The PRESENCE card utilises the PCIbus interface, to allow it to be used on standard desktop computer systems, and to give the CPU fast access to the card. The card is a PCI slave device, though future versions will include PCI busmastering capability in order to reduce CPU processing.
158
Michael Weeks et al. output pattern for document 5 0 0 0 0 1 0 0 0 0 0 0 0 0
n
aardvark bat cow dog elephant fox goldfish
document document document document document " " " " " " " document
1 2 3 4 5
Input Pattern keywords for document 5 are bat, elephant & goldfish
0 1 0 0 1 0 1
Fig. 1. An example of an inverted index teach
The PRESENCE architecture consists basically of a binary correlation matrix neural network implemented in memory, for the storage and retrieval of vector patterns. Each column of the matrix is seen as a neuron, and each row represents an input and synapses to each neuron. In common with the CMM concept, the PRESENCE card has two main modes of operation; teach and recall. In teach mode, the input and output binary pattern vectors are supplied to the card over the PCI bus. Recall is achieved by issuing only the input binary vector, and on completion, the resulting summed column data of the CMM can be read unprocessed from the card for postprocessing, or hardware thresholding can be applied to the data to sort the best matches. Two types of hardware thresholding can be applied (Willshaw or Lmax) dependent upon the application. Willshaw [10] thresholding compares the summed columns with a thresholded level, whilst Lmax [11] retrieves the top L matches from all of the summed columns. A detailed operation of CMM neural networks can be found at [12]. CMM techniques lend themselves for use in such applications as inverted indexes, whereby objects are stored in the CMM categorised by certain attributes. A keyword search of documents, for instance, where the output pattern associates one or more bits with a list of documents, and the input pattern describes keywords in the documents. Figure 1 illustrates how documents and keywords are taught into a CMM. A recall operation applies selected keywords as the input pattern, and the output pattern contains matching documents.
4
Cortex1 Beowulf Cluster
The number of PRESENCE cards that can be used on a single PC is limited to ﬁve slots, due to restrictions imposed by the PCIbus standard. PCItoPCI bridges allow secondary slots to be created (for the loss of one primary slot), but
Mapping CMM Applications onto a Beowulf Cluster
159
are impractical due to added latencies and the physical dimensions of the resulting system. In order to gain scalable I/O, the PRESENCE cards are distributed across the nodes of a Beowulf cluster (Cortex1). Communication between the cluster nodes is via sockets. Cortex1 is an eightnode Linux Beowulf cluster connected via fastethernet, whilst four nodes of the cluster are also connected via SCI (Scalable Coherent Interface) links, though this is not utilised as yet. Each node in the cluster has a 500MHz PentiumIII processor with 384MByte of system memory. Using ﬁve PRESENCE cards per node we can utilise 30 cards in six nodes, allowing a maximum possible CMM size of 3.84 GByte.
5
The AURA Library
The objectoriented AURA library, written in C++, is a collection of objects for creating and accessing CMM’s, in addition to pre and postprocessing algorithms. The library has been written for several machineOScompiler combinations, though we are initially concerned with Linux and gcc for this project. AURA CMM’s can be instantiated so that they map onto the PRESENCE hardware, or are simulated in software. Simulated CMM classes were initially designed to investigate the CMM concept, to experiment with new features, and to enable application development whilst the hardware progressed. The hardware PRESENCE cards in a node are encapsulated inside a NodeCMM class. This is based upon a lowlevel hardware driver (\dev\presdrv) that has been developed for insertion into the Linux kernel as a module. A static (hw ops) library enables the NodeCMM class to access the driver via the ioctl() system routine. Currently the lowlevel driver makes use of polling and as such all operations are blocking. An interrupt version of the driver is currently under development to make the driver nonblocking. A simulated CMM object’s storage capacity scales according to the amount of usable system memory, but the NodeCMM is limited to a maximum of 640MByte with ﬁve PRESENCE cards. A distributedCMM class has been designed that can seamlessly replace a simulated CMM object. This class encapsulates the underlying cluster infrastructure and each nodes NodeCMM’s, so that they are hidden to the application programmer. In this way, converting the existing applications to distributed PRESENCE cards should simply be a matter of redeclaring the CMM objects and recompiling. Two methods were considered for the speciﬁcation of the distributedCMM, depending upon the levels of abstraction handled by the master node. One level of abstraction allows the master to know everything about the slave nodes, basically addressing individual boards within the slave nodes, which are operating in an almost transparent manner. The second option appears to be the most ﬂexible and scalable design for the distributed CMM and is shown in ﬁgure 2. Here there are two levels of resource management increasing in abstraction towards the master. The lowest level is the Slave Process (SP) which manages the slave nodes resources and supplies abstract information about its overall “CMM
160
Michael Weeks et al.
Sockets
Slave Node
n
User Node
Slave Process
NodeCMM class Simulated CMM
pl
ic
at
io
Sockets
Master Process Manager
Sockets
Sockets
User Process
Sockets
Ap
Master Node
DistributedCMM
Sockets
Sockets NodeCMM class
Slave Process
Simulated CMM
Slave Node
Fig. 2. DistributedCMM object mapped onto Cortex1 cluster
capacity”. In this way a nodes resources are encapsulated, allowing the ﬂexibility for the Slave Process to decide to change where the CMM is, and so on, without the DistributedCMM needing to know this information. Information on the nodes capabilities are then passed to the Master Management Process (MMP), which resides centrally on the master node. The MMP handles all requests for resource information and CMM creation. However, requests for recalls are handled via direct connection from the DistributedCMM to the SP’s in order to avoid a bottleneck at the MMP and enhance concurrency. This scheme hides the complexity of the communications inside the DistributedCMM object whilst maximising communication eﬃciency. The latter is especially important as the number of nodes are scaled up. With spare processing capacity available on each slave node, we can further enhance the performance of the cluster by implementing a CMM in software. With suﬃcient memory resources, a software CMM is of comparable performance to a single PRESENCE card. This hybrid approach can only be possible when the driver supports interrupts.
6
NodeCMM Class Implementation
The ﬁrst step in implementation is to determine how a CMM is mapped onto a single card. The memory on a single PRESENCE card is organised as 8Mwords, where each word is 128bit wide. Figure 3 illustrates how a CMM with an output pattern greater than 128bits maps into this memory. Obviously, increasing the output pattern width to k · 128, decreases the input pattern width by k (or 8M locations ). k Whilst it is possible to map a CMM that requires less than 128MByte onto a single card, it is preferable to allocate a CMM over multiple cards by striping its
Mapping CMM Applications onto a Beowulf Cluster 0
CMM
1A 1B 1C
INPUT INDEX DATA 000000000010000000100001000
SEPARATOR VECTOR 1A
2A
3A
4A
2A 2B
1B
2B
3B
4B
2C
1C
2C
3C
4C
3A 3B
161
Weights Memory index address
128bit SUMMER
3C 4A 4B
128 bits
128 bits
128 bits
128 bits
index address + (3*CMM_input_size)
4C
128 bits
8M locations
Fig. 3. An example of a CMM mapping to weights memory in a single PRESENCE card
output vector. Firstly, available memory scales with the number of cards utilised, so ﬁve PCI PRESENCE cards in a PC crate allows 640MByte. Secondly, the time taken for a recall decreases if its output pattern is allocated across multiple cards. Figure 4 illustrates how the performance of a CMM recall increases when multiple card striping is employed. Note that the speedup does not scale linearly. This is due to a PCI bottleneck when collecting the data from the PRESENCE cards, plus additional preprocessing that is required to reconstitute the partial output patterns into the full CMM output pattern. Whilst the output pattern is scattered over the PRESENCE cards. The input pattern is broadcast to all cards as it is unlikely that the number of index terms used will exceed the 8 million possible on a PRESENCE card. If an input vector were to be split across multiple cards, upon recall their individual summed columns would be meaningless. To apply thresholding to the result of the CMM recall, the node must gather all raw data results, summing the appropriate column subtotals, before thresholding. Obviously, the large quantity of data retrieved and the post processing required makes this an ineﬃcient method and is therefore not implemented. Due to the natural scalability of Willshawthresholded recalls, summed column thresholding can remain in hardware. Lmax thresholded recalls, however, require the L highest summed columns on the whole output pattern. These highest matches may be all on one card, or more likely spread across multiple cards in the CMM. Since the thresholding engine on each card does not provide the raw column count values, this can not be scaled across multiple cards. The alter
162
Michael Weeks et al. 5500
32k * 160k CMM
5000 32k * 128k CMM recall time in us
4500 32k * 96k CMM
4000 3500
32k * 64k CMM
3000 2500 2000
32k * 32k CMM
1500 1000
1
2
3 4 Number of cards
5
6
Fig. 4. Time taken for a 30 indexterm recall on various size CMMs, when distributed across multiple cards
native approach is to perform multiple Willshaw thresholding at various levels, until there are L columns in the retrieved set.
7
Test Application
The application used for test purposes is a simple inverted index whereby objects are stored in the CMM categorised by certain attributes. The inverted index application used is a keyword search of documents. Other, more practical applications have been implemented using the AURA library’s simulated CMM’s, and it is planned that these will be adapted for later use with the cluster. Such applications include address matching, company trademark database matching, and molecular matching. Performance tests on the NodeCMM class have returned a time of 5.4ms for a 30 index term Willshaw recall on a CMM of input size 32768bits and output size 163840bits. The CMM was equally distributed over ﬁve cards, using a striped output pattern. When used in the document keyword search application, this relates to an eﬀective search rate of 30.34 million documents per second per node. Applying the results attained for the NodeCMM class, the distributedCMM performance will theoretically scale with the number of nodes. However, there will be some degradation due to communications and postprocessing overheads.
8
Conclusion
The techniques discussed in this paper are intended to scale up CMM performance and storage capacity when used with the PRESENCE architecture. It provides performance data for a nodelevel CMM, discusses how the cluster is used to provide scalable CMM’s, and how the spare processing capacity at each
Mapping CMM Applications onto a Beowulf Cluster
163
node can be utilised to provide a hybrid hardware/software distributed CMM. This is wrapped in an AURA library class to allow seamless incorporation into current AURA applications.
References 1. Sujeewa Alwis and Jim Austin. A novel architecture for trademark image retrieval systems. In Electronic Workshops in Computing. Springer, 1998. 2. Jim Austin. Adam: A distributed associative memory for scene analysis. In First Int. Conference on Neural Networks, volume IV, page 285, San Diego, June 1987. 3. Jim Austin and John Kennedy. The design of a dedicated adam processor. In IEE Conference on Image Processing and its Applications, 1995. 4. Jim Austin, John Kennedy, and Ken Lees. The advanced uncertain reasoning architecture. In Weightless Neural Network Workshop, 1995. 5. Jim Austin, John Kennedy, and Ken Lees. A neural architecture for fast rule matching. In Artiﬁcial Neural Networks and Expert Systems Conference (ANNES ’95), Dunedin, New Zealand, December 1995. 6. Anthony Moulds. Evaluation of multiple presence hardware in large systems. Technical report, ACA Group, Computer Science, University of York, 2000. 7. Anthony Moulds, Richard Pack, Zygmount Ulanowski, and Jim Austin. A high performance binary neural processor for PCI and VME busbased systems. In Weightless Neural Networks Workshop, 1999. 8. Aaron Turner and Jim Austin. Performance evaluation of a fast chemical structure matching method using distributed neural relaxation. In Fourth International Conference on KnowledgeBased Intelligent Engineering Systems, August 2000. 9. P. Zhou and J Austin. A pci bus based corelation matrix memory and its application to knn classiﬁcation. In MicroNeuro’99, Granada, Spain, April 1999. 10. D.J.Willshaw, O.P.Buneman, H.C.LonguetHiggins. Nonholographic associative memory. Nature 222(1969), 960962. 11. D.P.Casasent and B.A.Telfer. High capacity pattern recognition associative processors. Neural Networks 5(4), 251261, 1992. 12. Jim Austin. Distributive associative memories for high speed symbolic reasoning. Int. J. Fuzzy Sets Systems, 82, 223233, 1996.
The research detailed in this paper was funded by EPSRC grant numbers GR/L74651 and GR/K41090.
Accelerating RBF Network Simulation by Using Multimedia Extensions of Modern Microprocessors Alfred Strey and Martin Bange Department of Neural Information Processing University of Ulm, D89069 Ulm, Germany
[email protected] Abstract. All modern microprocessors oﬀer multimedia extensions that can accelerate many applications based on vector and matrix operations. In this paper the suitability of such units for the fast simulation of RBF networks is analyzed. It is shown that the reduced arithmetic precision is suﬃcient for recognition and training if rounding is supported. An experimental performance study revealed a high speedup in the range from 2 to 10 compared to sequential implementations.
1
Motivation
Current microprocessors are operating at high clock frequencies of about 1 GHz and achieve a satisfactory performance for most neural network applications. However for the simulation of neural networks that are embedded in realtime systems the power of a single microprocessor is often still insuﬃcient if complex pattern recognition tasks must be executed at high speed. This lack of performance becomes far more evident if also the training must be performed online to adapt the neural network parameters to the current environment in realtime. Many modern microprocessors contain a special dataparallel execution unit that is called either multimedia, vector or SIMD (Single Instruction Multiple Data) unit. It allows the simultaneous execution of arithmetic operations on several short dataelements packed in 64bit or 128bit registers. The user must explicitly program the parallel execution in the code by using special SIMD instructions. Although the architecture and the instruction sets of most multimedia units are mainly designed to accelerate popular multimedia algorithms, also many neural network operations can easily be mapped onto the SIMD units. However, no detailed analysis about the performance gain that can be achieved for neural network applications has been published so far. Merely Gaborit et al. have shown, that the calculation of the Mahalanobis distance in a generalized RBF network can be accelerated by a factor of 3 on Intel’s MMX unit [2]. In this paper, an RBF network trained by gradient descent is used as typical benchmark application. The following section summarizes the RBF algorithm and explains the main diﬀerences of ﬁve SIMD units available in current microprocessors. Section 3 discusses if the arithmetic precision oﬀered by multimedia G. Dorﬀner, H. Bischof, and K. Hornik (Eds.): ICANN 2001, LNCS 2130, pp. 164–169, 2001. c SpringerVerlag Berlin Heidelberg 2001
Accelerating RBF Network Simulation
165
units is suﬃcient for the simulation of RBF networks. Section 4 presents the results of an experimental performance study and Section 5 concludes this paper.
2
Approach
The RBF network represents a typical artiﬁcial neural network model suitable for many approximation or classiﬁcation tasks. To achieve good results, it requires a proper initialization of all prototypes cij and of the widths σj of the Gaussian radial basis functions in all RBF neurons. After initialization, all network parameters can be adapted by a gradientdescent training algorithm: m n δj = k=1 δk wjk (4) xj = i=1 (ui − cij )2 (1) 2 −xj /2σj −xj sj yj = e =e sj = sj − ηs xj yj δj (5) h (2) wjk = wjk + ηw yj δk (6) zk = j=1 yj wjk δk = tk − zk (3) cij = cij + ηc (ui − cij )δj yj sj (7) Eq. 1 to 7 represent the basic neural operations that will be analyzed throughout this paper. They are also relevant for many other neural network models. The calculation of the exponential function cannot be accelerated by the multimedia units. In all formulas sj = 1/2σj is used instead of the standard deviation σj to eliminate divisions that cannot be realized eﬃciently on most multimedia units. This modiﬁcation also improves the numeric stability of the algorithm and is essential for the implementation on architectures with limited precision. Five diﬀerent multimedia units have been selected for this experimental study (see Table 1): Intel’s MMX [4] (also available in actual AMD processors) and Sun’s VIS [7] allow only operations on integer data. Intel’s SSE [6] and AMD’s 3DNow! [3] support only the 32bit ﬂoating point data format, but also provide a few additional integer instructions for improving MMX. Motorola’s AltiVec [1] operates on integer and ﬂoatingpoint data. The parallelism degree p varies in the range from 2 to 16 depending on the selected data size and the the available register width. Table 1 lists all SIMD instructions required for the implementation of Eq. 1 to 7. Each instruction operates simultaneously on p corresponding data elements stored in two registers. Especially integer multiplications must be realized diﬀerently. A general 16 × 16 → 32 bit multiplication is available only on AltiVec. All other SIMD units require a sequence of partial multiplications, additions and reorder operations to generate the correct 32bit product. All neural operations according to Eq. 1 to 7 were implemented on the ﬁve selected multimedia either in assembly language (on Intel and AMD processors) or with a C language interface (on Sun and Motorola processors). For reference, all operations were implemented also in C using the float data type. The Gnu C compiler and assembler were used for Intel and AMD processors, the Sun Workshop 5.0 C Compiler for Sun and the Metrowerks CodeWarrior for Motorola’s PowerPC. Compiler optimizations have been switched on. As hardware platforms standard PCs with either 500 MHz Pentium III or 700 MHz Athlon, a Sun workstation with 400 MHz Ultra II processor and a Macintosh with a 500 MHz G4 PowerPC processor were used. The execution time of all seven neural
166
Alfred Strey and Martin Bange
Table 1. Characteristics and selected instructions of several multimedia units (used abbreviations: sat = saturated, m = modulo 2n , r = rounded, h = only higher result bits, l = only lower result bits, a = arbitrary result bits) MMX (Intel) Pentium II 64 8,16,32 bit 28 57 13
available in register width data types parallelism no of instr. latency cycles instructions: 16 ± 16 16[m], 16[sat] 32 ± 32 32[m], 32[sat] ﬂoat ± ﬂoat 8×16 16×16 16[h], 16[l] ﬂoat×ﬂoat 16±16×16 ﬂoat±ﬂoat×ﬂoat 16×16+16×16 32[m] reduction by pack (32→16) +[l,sat] merge/unpack 8, 16, 32 permutation
SSE 3DNow! VIS AltiVec (Intel) (AMD) (Sun) (Motorola) Pentium III K62 Ultra I/II PowerPC G4 128 64 64 128 ﬂoat ﬂoat 16,32 bit 8,16,32 bit, ﬂoat 4 2 24 416 70 45 85 162 15 2 13 14
+
+
+
16[h,r] +
+ + 16, 32
+ 16, 32
16[m] 32[m] 16[h], 24
+[a,sat] 8
16[m], 16[sat] 32[m], 32[sat] + 16[h,r], 32 + 16[h,r,sat] + 32[sat] 32[sat] +[l,sat] 16, 32 8
operations and the total RBF network simulation time related to the three different network sizes 1610416, 6441664 and 128832128 were measured on the multimedia and the ﬂoatingpoint units of all processors.
3
Analysis of Precision Requirements
To exploit the fast SIMD integer operations on MMX, VIS and AltiVec, all neural network variables must be encoded as ﬁxedpoint numbers and mapped onto 8bit or 16bit integer data elements. However, Vollmer et al. have demonstrated that for the training of an RBF network applied to a complex pattern recognition task more than 20 bit precision are required [8]. In their experimental study all intermediate results were computed with a high precision and then truncated to a lower precision according to the selected data size. To analyze also the implication of rounding that is supported by a few SIMD units, an RBF network was trained by gradient descent for a classiﬁcation task. The precision of all variables was varied and both mean squared error and classiﬁcation rate were determined experimentally on training set (memorization) and test set (generalization). Figure 1 illustrates that with truncation especially the weight update steps (Eq. 5, 6 and 7) require a high precision of 18 to 20 bit because the quantization error is biased and accumulated during many training epochs. Rounding drastically reduces the precision demands: 11 to 14 bit are suﬃcient to achieve results comparable to those of the reference implementation.
Accelerating RBF Network Simulation classification with RBF after 100 epochs
1
1
0.8
0.8
mean squared error
mean squared error
classification with RBF after 100 epochs
0.6
rounded, gen. rounded, mem. trunc., gen. trunc., mem.
0.4
0.2
95.8
0.6
rounded, gen. rounded, mem. trunc., gen. trunc., mem.
0.4
0.2
95.8 97.7
97.7 0
8
10
12 14 16 18 precsion of c (in bit)
20
167
float
0
8
10
12 14 16 precision of w (in bit)
18
float
Fig. 1. Precision requirements of the RBF network variables cij and wjk
Thus, a precision of 16 bit was selected for the SIMDparallel integer implementation of all neural operations on MMX, VIS and AltiVec. Rounded arithmetic operations were always preferred, although they are not available in all instruction sets (compare Table 1): AltiVec oﬀers a rounding option for the result of all multiplications, MMX supports rounding only in some integer arithmetic instructions added by AMD’s 3DNow! extension. Besides of rounding, also saturation is required for the correct ﬁxedpoint implementation of neural network operations. Fortunately, most multimedia units support at least saturated additions/subtractions (apart from Sun’s VIS) and allow the saturated extraction of 16bit words out of 32bit results by special pack instructions. Altogether, only the ﬁxedpoint implementations on Motorola’s AltiVec and on the MMX unit of AMD processors promise acceptable precision for a highquality RBF implementation including the training phase. Nevertheless, all neural operations were evaluated on all multimedia units because in some cases (e.g. if only the recognition phase is required or if a lower quality is suﬃcient) also the implementations on the other units may be justiﬁed.
4
Analysis of Performance
Fig. 2 shows the total time for calculating the RBF network output and adapting all parameters according to the presentation of a new input vector u on all SIMD units. It is evident that the SIMD ﬁxedpoint implementations are always faster than their SIMD ﬂoatingpoint counterparts. To study in more detail the suitability of the ﬁve multimedia units for the acceleration of certain neural operations, also the speedup of all single neural operations according to Eq. 1 to 7 compared to the reference implementation was calculated. In the ﬁrst line of Table 2 the parallelism degree p is listed that represents a kind of theoretical speedup. It can be seen that for certain operations a high speedup can be obtained that exceeds the theoretical speedup by far. On MMX the computation of xj and zk required in the recognition phase is 6.3 to 9.8 times faster than the reference ﬂoat implementation. Also for AMD’s
Alfred Strey and Martin Bange µ sec.
3DNow!
50
1 0 0 1
00000000000000000000000000000000 11111111111111111111111111111111 11001100 0 1 0 1 11111111111111111111111111111111 00000000000000000000000000000000 0 1 11001100 0 1 0 1 0 0 1 0 1 0 1 00 11 110011001 0 1 0 1 0 1 0 1 110010 1100 MMX SSE
100
Intel
AMD
Motorola
SIMDparallel integer implememtation SIMDparallel float implememtation seqential float implememtation
VIS
200 150
0011
network size 1610416 AltiVec (fixed point) AltiVec (float)
168
Sun
msec.
network size 128832128
16 14
1
0 001111001 0 1 0 1 0 0 1 11001
Intel
1 0 0 1 0 1 AMD
0 1
0011110010101010 110010
Motorola
00111100 11001100 1100
Sun
6 4 2
00111100 1100
Intel
1 0 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 AMD
1 0 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
00111100 11110000
Motorola
00111100 11001100 11001100 111 000 111 000 VIS
1 0 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
3DNow!
SSE
8 MMX
3DNow!
2
MMX SSE
3
10
VIS
4
AltiVec (fixed point) AltiVec (float)
network size 6441664
AltiVec (fixed point) AltiVec (float)
12 msec.
Sun
Fig. 2. Total RBF network simulation time for three diﬀerent network sizes Table 2. Measured speedup for seven neural operations on ﬁve multimedia units (with three network sizes 1610416 / 6441664 / 128832128)
p xj zk δk δj sj wjk cij
MMX (Intel) 4 6.7/6.3/6.9 6.6/7.4/9.8 1.7/2.0/2.3 2.6/2.7/4.5 2.1/2.3/2.4 5.7/3.7/5.4 2.9/2.5/3.2
SSE (Intel) 4 3.7/3.5/3.2 4.7/4.2/3.5 1.3/1.6/1.8 3.7/3.6/3.4 2.1/2.3/2.3 3.7/2.0/2.0 3.3/1.5/1.1
3DNow! AltiVec (Motorola) VIS (AMD) (ﬁxed) (ﬂoat) (Sun) 2 8 4 24 5.6/3.6/3.2 6.7/8.6/10.0 3.0/3.5/3.9 2.1/2.4/2.4 5.6/4.6/2.4 7.9/6.6/6.8 3.9/3.0/3.1 1.7/1.7/1.7 1.1/1.0/1.0 3.0/5.0/5.5 3.0/2.5/2.7 1.8/2.2/2.4 4.7/4.1/3.0 5.3/4.3/4.5 3.5/2.8/2.8 2.0/2.0/2.0 0.9/1.0/1.0 4.1/4.3/4.5 2.7/2.7/2.7 0.8/0.8/0.8 3.3/1.9/1.5 7.8/6.9/7.5 3.8/3.4/3.5 1.5/1.7/1.8 2.3/1.6/1.4 3.6/4.5/5.5 2.7/2.6/3.0 2.5/3.2/3.2
3DNow! which oﬀers only a parallelism degree of p = 2 the measured speedup for many operations (e.g. the computation of xj , zk or δj for small networks) is surprisingly high (up to 5.6 times faster than the ﬂoat implementation). This anomaly can be explained by special SIMD instructions (such as multiply&add or vector reduction, compare Table 1) that replace more than p sequential ﬂoat instructions and by shorter latencies provided by SIMD arithmetic instructions (compared to the sequential instructions on the corresponding processor core). On all units not more than half the theoretical speedup can be achieved for δk and sj calculations. This may be due to the simplicity of Eq. 3 and 5: Only one operation is performed with each element loaded from memory. For the remaining three neural operations (calculation of wjk , δj and cij ) the performance is only average. Here the theoretical speedup cannot be reached because a high number of reorder steps (e.g. for the replication of scalar operands) are necessary. In case
Accelerating RBF Network Simulation
169
of wjk and cij calculations, also the slow 16 × 16 bit ﬁxedpoint multiplications of some SIMD units reduce the performance. Two eﬀects appear when the network size is varied. On the one hand the speedup is increased for some ﬁxed point implementations (e.g. for the calculation of zk and δj on MMX or for xj and cij on AltiVec) if the network is enlarged. On the other hand there are ﬂoating point implementations of several operations (e.g. xj , zk on SSE or most operations on 3DNow!) that show the reverse eﬀect. This behavior results from cache eﬀects and is analyzed in [5].
5
Results and Conclusions
This case study demonstrates that the multimedia units of modern microprocessors are fairly well suited for the acceleration of neural network simulations. A high speedup in the range from 1.9 (on Sun’s VIS with small networks) to 6.6 (for the ﬁxedpoint realization on Motorola’s AltiVec with large networks) can be achieved for the simulation of a complete RBF training step. Furthermore, applications that require only the fast recognition of presented patterns can be accelerated by a factor of approximately 10 on some multimedia units. The SIMD instruction sets turned out to be not optimal for neural network simulation. By introducing a few additional instructions (such as more general 16 × 16 bit multiplications, faster replication of scalar operands in registers or more powerful vector reductions) the speedup could be increased further. Also the missing possibility to round low precision results restricts the applicability of some multimedia units to very simple pattern recognition tasks. Unfortunately, the neural operations must be encoded on all multimedia units in lowlevel languages. As long as highlevel compiler support will be missing, the high programming eﬀort can be justiﬁed only for timecritical applications.
References 1. K. Diefendorﬀ, P.K. Dubey, R. Hochsprung, and H. Scales. AltiVec extension to PowerPC accelerates media processing. IEEE Micro, 20(2):85–95, 2000. 2. L. Gaborit, B. Granado, and P. Garda. Evaluating microprocessors’ multimedia extensions for the real time simulation of RBF networks. In Proceedings of MicroNeuro, pages 217–221. IEEE, 1999. 3. Stuart Oberman, Greg Favor, and Fred Weber. AMD 3DNow! technology: Architecture and implementations. IEEE Micro, 19(2):37–48, 1999. 4. A. Peleg, S. Wilkie, and U. Weiser. Intel MMX for multimedia PCs. Communications of the ACM, 40(1):24–38, 1997. 5. A. Strey and M. Bange. Performance Analysis of Intel’s MMX and SSE: A Case Study. In Proceedings of EuroPar 2001, Manchester, August 2831, 2001. 6. S.T. Thakkar and T. Huﬀ. Internet streaming SIMD extensions. Computer, 32(12):26–34, December 1999. 7. M. Tremblay, J.M. O’Connor, V. Narayanan, and L. He. VIS speeds new media processing. IEEE Micro, 16(4):10–20, 1996. 8. U. Vollmer and A. Strey. Experimental study on the precision requirements of RBF, RPROP and BPTT training. In Proceedings of ICANN 99, pages 239–244. IEE Conference Publication No. 470, 1999.
A GameTheoretic Adaptive Categorization Mechanism for ARTType Networks Waikeung Fung and Yunhui Liu Department of Automation and Computer Aided Engineering, The Chinese University of Hong Kong, Shatin, N. T., Hong Kong {wkfung,yhliu}@acae.cuhk.edu.hk
Abstract. A gametheoretic formulation of adaptive categorization mechanism for ARTtype networks is proposed in this paper. We have derived the gametheoretic model ΓAC for competitive processes of categorization of ARTtype networks and an update rule for vigilance parameters using the concept of learning automata. Numbers of clusters generated by ART adaptive categorization are similar regardless of the initial vigilance parameters ρ assigned to the ART networks as demonstrated in the experiments provided. The proposed ART adaptive categorization mechanism can thus avoid the problem of choosing suitable vigilance parameter a priori for pattern categorization.
1
Introduction
ARTtype (Adaptive Resonance Theory) networks [1][2] are a class of wellknown categorization neural networks for its incremental categorization capability. The cluster granularity generated by ARTtype networks is controlled by a ﬁxed scalar called vigilance parameter ρ. This paper incorporates adaptive categorization (variablesized clustering) into ARTtype networks by adjusting vigilance parameter ρ. Few approaches have been proposed for changing the vigilance parameter of ARTtype networks and existing methods just blindly increase the vigilance parameter by a ﬁxed amount when all committed F2 neurons are exhausted [3]. This approach will eventually set vigilance parameter to 1 and as a result any new pattern will form its own cluster. In order to solve the problem, we propose a gametheoretic formulation on the adaptive vigilance parameter strategy in ARTtype networks [4] with the help of learning automata theory [5]. The proposed ρadaptation scheme can be easily added on the original design of ARTtype networks. The gametheoretic vigilance adaptation strategy improves the clustering performance of ARTtype networks in the aspect of category number stability, despite of what the prespeciﬁed initial vigilance parameter is. Therefore, it is possible to avoid the problem of choosing suitable vigilance parameter a priori for data categorization by the trialanderror approach [4].
This work is supported in part by the Hong Kong Research Grant Council under grant CUHK 4151/97E
G. Dorﬀner, H. Bischof, and K. Hornik (Eds.): ICANN 2001, LNCS 2130, pp. 170–176, 2001. c SpringerVerlag Berlin Heidelberg 2001
ART Adaptive Categorization
171
This paper is organized as follows. Section 2 describes the mathematical formulation of the adaptive categorization game ΓAC and the derived vigilance parameter update rule. Section 3 presents categorization experiment results with and without the proposed adaptive categorization mechanism. In additions, summary of this paper is given in Section 4.
2
Adaptive Categorization in ARTType Networks
Original ARTtype networks only use single, ﬁxed vigilance parameter for all clusters. Fixed size clusters are diﬃcult to represent thoroughly the data subspace and misclassiﬁcation often happens in categorizationbased classiﬁcation. On the other hand, variable size clusters, that are generated by adaptive vigilance parameter mechanism, have the capability of approximating data pattern subspace well and even rendering decision boundaries, and thus misclassiﬁcation can be avoided. Moreover, adaptive categorization helps preventing misclassiﬁcation with data patterns from disjoint distributions. This paper presents a mathematical formulation of gametheoretic vigilance parameter adaptation mechanism for adaptive categorization in ARTtype networks. Each F2 neuron has its individual vigilance parameter ρi and an update rule for adaptively adjusting vigilance parameter during categorization process. 2.1
GameTheoretic Formulation
The competitive clustering mechanism in ARTtype networks is formulated as an inﬁnite nperson noncooperative game, which is characterized by the index set P = {1, 2, . . . , n} of F2 neurons (player of the game ΓAC , the strategy set R(i) of the ith player and the payoﬀ function π (i) of the ith player. The vigilance parameter for each F2 neuron forms its strategy ρi ∈ R(i) in ARTtype networks and ρi is usually bounded in [0, 1]. Vigilance parameter adaptation strategy will be derived based on the Nash Equilibrium of the gametheoretic formulation of ARTtype networks. Each F2 neuron (player of the game ΓAC ) must attend two independent tests for each pattern presentation: the matching score test (MST) and the vigilance test (VT) [4]. F2 neurons can then be classiﬁed into three groups when a data pattern is presented to an ART network according to three possible states, Σ = {R, r, f }, after every data presentation (see Fig. 1): RESONANCE (R) : Only one F2 neuron can be in the resonance state as it passed both the MST and VT. The presented data pattern is assigned to the category represented by this particular F2 neuron. RESET (r) : F2 neurons in the reset state have passed the MST but failed the VT. Denote the number of F2 neurons in reset state by k. FAIL (f ) : F2 neurons in the fail state have failed both the MST and VT. There are (n − k − 1) F2 neurons in this state.
172
Waikeung Fung and Yunhui Liu
In our algorithm, only F2 neurons, that are in states RESONANCE or RESET may have their vigilance parameters updated for next pattern. If a pattern is categorized with direct access (no F2 neuron is in RESET state), no update on vigilance parameters will be conducted. The basic derivation steps of adaptive categorization ΓAC follows the Cournot game for oligopoly market model [6]. Each F2 neuron incurs costs when it attends MST and VT and acquires rewards if it passes the tests. In the MST, the cost incurred and reward received by a F2 neuron depend only on the matching score µi of that F2 neuron. On the other hand, the cost incurred and reward received by a F2 neuron depend only on the vigilance parameter ρi of that F2 neuron in the VT. The (i) (i) incurred costs of the MST cM ST and the VT cV T by the ith F2 neuron form linear relations with µi and ρi respectively as follows, (i) cM ST = αM ST + βM ST µi
and
cV(i)T = αV T − βV(i)T ρi
(1)
where αM ST , βM ST , αV T and βV(i)T are positive constants. The cost c(i) V T increases with decreasing ρi as the F2 neurons are encouraged to have ﬁne clustering (high (i) ρ) for better approximation to data subspace. The rewards of the MST rM ST and (i) the VT rV T obtained by the ith F2 neuron are given as, (i) (n − 1) µi − rM and rV(i)T = ρi µj ρj − kρi (2) ST = µi j=i
j∈I(t) j=i
where n is the total number of F2 neurons involved in categorization and k is the number of F2 neurons in RESET state. {xR , xr , xf } p
(i) RR
Environment
p(i) fR Fail (f)
p
(i) ff
(i) pRf
(i) prR
(i) prf
pf(i)r
Data Patterns
Resonance (R) (i) pRr
Reset (r)
θ (i) ∈ {−1, 0, 1}
Adaptative Categorization Game (i) ΓAC
(i) rr
p
Fig. 1. State Transitions of a F2 neuron.
s(i) (t) ∈ Σ
Learning Automaton
L(i) AC
(i) Fig. 2. ΓAC and the environment.
Each F2 neuron will try its best to win the MST and tend to put as small eﬀort (ρi ) as possible to win the VT. The net gain of F2 neurons in state σi ∈ Σ are then given as follows, (i) (i) πR(i) = (rV(i)T − cV(i)T ) + (rM ST − cM ST ) , πf(i) = −c(i) M ST
(i) (i) πr(i) = −cV(i)T + (rM ST − cM ST ) ,
ART Adaptive Categorization
173
Denote the state of the ith F2 neuron at the tth pattern presentation by s(i) (t) and let I(t) ⊂ P be the index set of F2 neurons in states RESONANCE or RESET after the tth pattern presentation. The payoﬀ function π (i) of the ith (i ∈ I(t)) F2 neuron at the tth pattern presentation is then deﬁned as the expected gain of that F2 neuron in the three possible states at the tth pattern Prob (s(i) (t) = σu ) πu(i) (t), where Prob(·) denotes presentation as π (i) (t) = σu ∈Σ
the probability of the given outcome to occur. 2.2
State Probability Dynamics of F2 Neuron (i)
A learning automaton LAC , i = 1, 2, . . . , n, is constructed for each F2 neuron to (i) track the variations of state probabilities with time [5]. Each LAC in the adaptive categorization game ΓAC for each F2 neuron consists of a set of internal states Σ = {R, r, f }1 , a set of input reinforcement signals θ(i) ∈ Θ = {−1, 0, 1}, a state (i) transition probability matrix P ∈ R3×3 that governs state transitions of LAC in any two consecutive time instants and a reinforcement scheme for action probability update. Fig. 2 depicts the interactions between the game ΓAC or learning automaton and the environment. The environment is assumed to generate data patterns that the ARTtype network will categorize and reinforcement signals θ(i) to categorization. The purpose of reinforcement signal, which depends on the current state s(i) (t) ∈ Σ of L(i) AC , is to guide the state probability adjustment according to the tracking capability of the learning automaton to its environment. Reinforcement scheme of the learning automaton L(i) AC of the ith F2 neuron provides the update rule for state probabilities. The derivation is as follows. The
(i) conﬁrmatory transition probability qvu = Prob s(i) (t) = σu s(i) (t + 1) = σv can be easily deduced from Bayes Theorem. The reinforcement signal θ(i) , reﬂecting the tracking performance of the learning automaton on its situated en(i) vironment, is deﬁned based on the state transition probabilities p(i) uv and qvu ,
θ(i)
(i) (i) puv qvu −1 if s(i) = argmin σu ∈Σ σv ∈Σ (i) (i) = 1 if s(i) = argmax puv qvu σ ∈Σ u σv ∈Σ 0 otherwise
(3)
(i) The sumandproduct term σv ∈Σ p(i) uv qvu measures the amount of evidence that supports the transition from state σu at the tth pattern presentation to state σv at the (t + 1)th pattern presentation. The reinforcement scheme for learning automaton L(i) AC is proposed to have linear reward and penalty functions pair and it is listed as follows,
(i) ξu (t) − 12 a(1 + θ(i) )ξu(i) (t) + 12 b(1 − θ(i) ) 12 − ξu(i) (t) if s(i) (t) = σu (i) ξu (t+1) = (i) ξu (t) + 12 a(1 + θ(i) )(1 − ξu(i) (t)) − 12 b(1 − θ(i) )ξu(i) (t) if s(i) (t) = σu 1
The state of a F2 neuron is reﬂected in the state of learning automaton associated to it.
174
2.3
Waikeung Fung and Yunhui Liu
ρAdaptation – Nash Equilibrium of ΓAC
Vigilance parameters are adapted at the Nash Equilibrium of the game ΓAC . The Nash equilibrium ρ∗ of ΓAC is deﬁned as a strategy that satisﬁes the best response functions of all players (F2 neurons), which gives the best reply to strategies ρi = ρ \ ρi of other F2 neurons [6]. The best response function of the (i) ith F2 neuron is then given by setting ∂ π = 0. Therefore, the Nash equilibria ∂ ρi of ΓAC are given as pairs of (ρ∗ , βVT ) that satisﬁes the equation Ψρ∗ = ΩβVT , or ∗ (I1 ) ξ (I1 ) 1 + r(I1 ) ρI1 βV T 2k −1 · · · −1 ξR βV(IT2 ) −1 2k · · · −1 ρ∗I 2 .. (4) .. .. .. . . .. .. = . . . . . . . (Ik+1 ) ξ (I ) 1 + r(Ik+1 ) ρ∗Ik+1 −1 −1 · · · 2k βV Tk+1 ξR Ψ ∈ R(k+1)×(k+1) ρ∗ ∈ Rk+1 βVT ∈ Rk+1 (k+1)×(k+1) Ω∈R where Ij is the jth element in the index set I(t). The “kinetic” energy of the ith F2 neuron is the same in magnitude of its (i) (i) (i) ∗ 2 payoﬀ at Nash equilibrium πN E , which is deﬁned as Ki = πN E = kξR (ρi ) + (i) Ai . Every F2 neuron is eager to gain as much payoﬀ πN E as possible in the competition for being in the RESONANCE state by tuning its ρ∗i to 1 during categorization. However, it is not economical because the total energy supplied by all F2 neurons in categorization process is not minimized so that all F2 will eventually change their vigilance parameters to the extreme values2 . Vigilance parameters are adapted so that minimum energy is consumed by the F2 neurons so as to overcome the potential barrier in becoming the winning F2 neuron during the categorization of data patterns. The potential barrier Pi of avoiding the ith F2 neuron from becoming a winning neuron (ie. in RESONANCE state)
is Pi = (1 − ξR(i) )ρ∗i . Intuitively, the potential barrier increases with increasing vigilance parameter and the state probability ξR(i) indicates the degree of easiness for the F2 neuron to overcome the potential barrier. ξR(i) introduces an inhibitory eﬀect on the F2 neuron RESONANCE state potential barrier. The diﬀerence between Ki and Pi is minimized subject to vigilance parameters of F2 neurons, ρ∗i i ∈ I(t), so that the F2 neurons can consume the minimal energy to overcome the potential barriers in the next pattern presentation. By deﬁning the Lagrangian Li for each F2 neuron, i ∈ I(t), Li = Ki − Pi = kξR(i) (ρ∗i )2 − (1 − ξR(i) )ρ∗i + Ai . The ∂Li updated vigilance parameter, ρ∗i is given by setting = 0. Then the vigilance ∂ρ∗i parameter for the ith F2 neuron at the tth pattern presentation is updated by 2
This argument is analogous to the “principle of least action” hypothesis proposed by PierreLouis Moreau de Maupertuis (16981759) in the ﬁeld of analytical dynamics [7].
ART Adaptive Categorization
175
(i) 1 1 − ξR if ξR(i) > ∗ (i) ρi (t) = 2k + 1 , where i ∈ I(t). The condition imposed in ∗2kξR ρi (t − 1) otherwise the vigilance parameters update law is to restrict each ρ∗i to lie in its nominal range (0, 1).
3
Simulations
Simulations results are presented to compare the performances of Fuzzy ART networks [2] with and without using the proposed gametheoretic ρadaptation. Learning rates in the reinforcement scheme are set as a = 0.75 and b = 0.1. 2000 uniformly distributed random twodimensional data patterns, which are conﬁned in a pair of disjoint distributions, are generated for simulations. Tests on Fuzzy ART networks with and without ρadaptation are performed with starting vigilance parameters at 0.4, 0.55, 0.7 and 0.85 with Fuzzy ART learning rate of 0.9. Fig. 3 depicts the categorization results in the tests. The number of categories formed in the tests without ρadaptation are 4, 8, 14 and 43 with starting ρ at 0.40, 0.55, 0.70 and 0.85 respectively. On the other hand, the number of categories formed in the tests with ρadaptation are 74, 71, 73 and 73 with starting ρ at 0.40, 0.55, 0.70 and 0.85 respectively. Initial rho=0.40
Initial rho=0.55
Initial rho=0.70
Initial rho=0.85
1
1
1
1
0.8
0.8
0.8
0.8
0.6
0.6
0.6
0.6
0.4
0.4
0.4
0.4
0.2
0.2
0.2
0.2
0
0
0.2
0.4
0.6
0.8
1
0
0
(a) ρ = 0.40
0.2
0.4
0.6
0.8
1
0
0
(b) ρ = 0.55
Initial rho=0.40
0.2
0.4
0.6
0.8
1
0
(c) ρ = 0.70
Initial rho=0.55
Initial rho=0.70 1
1
0.8
0.8
0.8
0.8
0.6
0.6
0.6
0.6
0.4
0.4
0.4
0.4
0.2
0.2
0.2
0.2
0.2
0.4
0.6
(e) ρ = 0.40
0.8
1
0
0
0.2
0.4
0.6
(f) ρ = 0.55
0.8
1
0
0
0.2
0.4
0.6
(g) ρ = 0.70
0.4
0.6
0.8
1
0.8
1
Initial rho=0.85
1
0
0.2
(d) ρ = 0.85
1
0
0
0.8
1
0
0
0.2
0.4
0.6
(h) ρ = 0.85
Fig. 3. Categorization experiments by Fuzzy ART networks with (lower row) and without (upper row) ρadaptation.
As shown in Fig. 3, the Fuzzy ART prototypes [2] generated are displayed as rectangles. The prototype rectangles generated by Fuzzy ART without ρ
176
Waikeung Fung and Yunhui Liu
adaptation render the pattern distribution boundary poorly, especially for lower ﬁxed vigilance parameters. On the other hand, the prototype rectangles generated by Fuzzy ART with ρadaptation render the pattern distribution boundary well no matter what value the starting vigilance parameter is. Moreover, the number of categories generated by Fuzzy ART without ρadaptation grows geometrically with increasing starting vigilance parameters while the number of categories generated by Fuzzy ART with ρadaptation is much insensitive to the starting vigilance parameter chosen. Thus ρadaptation remedies the diﬃculties in choosing a prior of vigilance parameters in data clustering using ARTtype networks. The categories generated with ρadaptation cover far less patterns from either of the disjoint distributions than that generated by conventional Fuzzy ART network, as shown in Fig. 3. The categories generated by ρadaptive Fuzzy ART network can even be divided into two distinct groups that contain patterns from one and only one distribution when the starting vigilance parameter is 0.85. This helps to avoid misclassiﬁcation in classiﬁer systems that are constructed from Fuzzy ART networks.
4
Summary
This paper proposed a mathematical formulation of adaptive categorization of ARTtype networks based on the game theory. We have derived the gametheoretic model ΓAC for competitive processes of clustering of ARTtype networks and an update rule for vigilance parameters using the concept of learning automata. Categorization experiments demonstrated that the gametheoretic vigilance parameter adaptation can improve the clustering performance of ART networks in the aspect of category number stability and avoid the problem of choosing suitable vigilance parameter a priori for pattern categorization. Moreover, the coverage of clusters generated by ART networks with ρadaptation can reﬂect the shape of pattern distribution and thus prevent misclassiﬁcation in categorizationbased classiﬁcation.
References 1. G. A. Carpenter and S. Grossberg. A Massively Parallel Architecture for a SelfOrganizing Neural Pattern Recognition Machine. Computer Vision, Graphics and Image Processing, 37:54–115 1987. 2. G. A. Carpenter, S. Grossberg, and D. B. Rosen. Fuzzy ART: Fast Stable Leaming and Categorization of Analog Patterns by an Adaptive Resonance System. Neural Networks, 4:759–71 1991. 3. N. Vlajic and H. C. Card. Categorizing Web pages using Modiﬁed ART. In 1998 IEEE Canadian Conference on Electrical and Computer Engineering, volume 1, pages 313316, 1998. 4. W. K. Fung and Y. H. Liu. A Gametheoretic Formulation on Adaptive Categorization in ART Networks. In Proceedings of 1999 International Joint Conference on Neural Networks IJCNN’99, volume 2, pages 1081–1086, 1999. 5. K. S. Narendra and M. A. L. Thathachar. Learning Automata: An Introduction. Prentice Hall, 1989. 6. D. Fudenberg and J. Tirole. Game Theory. The MIT Press, 1991. 7. Jr. J. H. Williams. Fundamentals of Applied Dynamics. John Wiley and Sons, 1996.
Gaussian Radial Basis Functions and InnerProduct Spaces Irwin W. Sandberg Department of Electrical and Computer Engineering The University of Texas at Austin Austin, Texas 78712, USA
[email protected] Abstract. An approximation result is given concerning gaussian radial basis functions in a general innerproduct space. Applications are described concerning the classiﬁcation of the elements of disjoint sets of signals, and also the approximation of continuous real functions deﬁned on all of IRn using RBF networks. More speciﬁcally, it is shown that an important large class of classiﬁcation problems involving signals can be solved using a structure consisting of only a generalized RBF network followed by a quantizer. It is also shown that gaussian radial basis functions deﬁned on IRn can uniformly approximate arbitrarily well over all of IRn any continuous real functional f on IRn that meets the condition that f (x) → 0 as x → ∞.
1
Introduction
Radial basis functions are of interest in connection with a variety of approximation problems in the neural networks area, and in other areas as well. Much is understood about the properties of these functions (see, for instance, [1]–[3]). It is known [2], for example, that arbitrarily good approximation in L1 (IRn ) of a general f ∈ L1 (IRn ) is possible using uniform smoothing factors and radial basis functions generated in a certain natural way from a single g in L1 (IRn ) if and only if g has a nonzero integral. As another example, in [3] it is proved that gaussian radial basis functions can uniformly approximate arbitrarily well any continuous real functional deﬁned on a compact convex subset of IRn . Here we give an approximation result concerning gaussian radial basis functions in a general inner product space. This result, Theorem 1 in Section 2, has two applications that are felt to be interesting. In particular, we show that gaussian radial basis functions deﬁned on IRn can in fact uniformly approximate arbitrarily well over all of IRn any continuous real functional f on IRn that meets the condition that lim f (x) = 0. x→∞
This generalizes the result in [3] because, by the LebesgueUrysohn extension theorem [4, p. 63] (sometimes attributed to Tietze), any continuous real funcG. Dorﬀner, H. Bischof, and K. Hornik (Eds.): ICANN 2001, LNCS 2130, pp. 177–182, 2001. c SpringerVerlag Berlin Heidelberg 2001
178
Irwin W. Sandberg

x
 @ @@ 
y1 .. .
yk

h

Q
w
Fig. 1. Classiﬁcation network.
tional deﬁned on a bounded closed subset of IRn can be extended so that it is deﬁned and continuous on all of IRn and meets the above condition.1 Our second application concerns the problem of classifying signals. This problem is of interest in several application areas (e.g., signal detection, computerassisted medical diagnosis, automatic target identiﬁcation, etc.). Typically we are given a ﬁnite number m of pairwise disjoint sets C1 , . . . , Cm of signals, and we would like to synthesize a system that maps the elements of each Cj into a real number aj so that the numbers a1 , . . . , am are distinct. In [5] (see also [6]) it is shown that the structure shown in Fig. 1 can perform this classiﬁcation (for any prescribed distinct a1 , . . . , am ) assuming only that the Cj are compact subsets of a real normed linear space. In the ﬁgure, x is the signal to be classiﬁed, y1 , . . . , yk are functionals that can be taken to be linear2 , h denotes a continuous memoryless nonlinear mapping that, for example, can be implemented by a neural network having one hidden nonlinear layer, the block labeled Q is a quantizer that for each j maps numbers in the interval (aj − 0.25ρ, aj + 0.25ρ) into aj , where ρ = mini=j ai − aj , and w is the output of the classiﬁer3 . Here, in Theorem 3 of Section 2, we show that a similar classiﬁcation problem can be solved in a very diﬀerent way using just a generalized radial basis function network and a quantizer of the type in Fig. 1. Theorem 3 establishes in addition that the network structure considered can also classify signals x whose critical property is that the norm of x is suﬃciently large. Some additional related results are given in Section 2.
1
2 3
For example, let a continuous f0 deﬁned on a bounded closed subset A of IRn be given, and let A be contained in an open ball B centered at the origin of IRn . Then, since the complement C of B with respect to IRn is closed (and thus A ∪ C is closed), by the LebesgueUrysohn extension theorem there is a continuous extension f of f0 deﬁned on IRn such that f (x) = 0, x ∈ C. Examples are given in [5] and [6]. The proof given in [5],[6] expands on a brief remark in [7, p. 274] that any continuous real functional on a compact subset of a real normed linear space can be uniformly approximated arbitrarily well using only a feedforward neural network with a linearfunctional input layer and one memoryless hidden nonlinear (e.g., sigmoidal) layer.
Gaussian Radial Basis Functions and InnerProduct Spaces
2
179
Approximation and Classiﬁcation Using Generalized RBF Structures
2.1
Preliminaries
Let S be a real innerproduct space (i.e., a real preHilbert space) with inner product · , · and norm · derived in the usual way from · , · . Let X be a metric space whose points are a subset of the points of S, and denote the metric in X by d. With V any nonempty convex subset of S such that { · + v : v ∈ V } separates the points of X (i.e., such that for x and y in X with x = y there is a v ∈ V for which x + v = y + v), and with P any nonempty subset of (0, ∞) that is closed under addition, let X0 denote the set of functions g deﬁned on X that have the representation 2
g(x) = α exp{−β x − v }
(1)
in which α ∈ IR, β ∈ P , and v ∈ V . Let X stand for the set of continuous functions f from X to the reals IR with the property that for each > 0 there is a compact subset Xf, of X such that f (x) − f (y) < for x, y ∈ / Xf, . And let X∞ denote the family of f ’s in X such that for each f and each > 0 there is a compact subset Xf, of X such that f (x) < for x∈ / Xf, . 2.2
Approximation
Our main result concerning approximation is the following. Theorem 1: Assume that X is locally compact but not compact,4 and that X is such that X0 ⊂ X∞ . Then for each f ∈ X∞ and each > 0 there are a positive integer q, real numbers α1 , . . . , αq , numbers β1 , . . . , βq belonging to P , and elements v1 , . . . , vq of V such that f (x) −
q
2
αk exp{−βk x − vk } <
(2)
k=1
for all x ∈ X. Proof: All proofs are omitted in this version of the paper. Comments: Since that x − v =y − v is equivalent to the the condition 2 2 y in X implies condition that 2 x − y, v = x − y , and assuming that x = that x =y in S, we see that V can be taken to be, for instance, any convex 4
X is locally compact if each point of X is interior to some compact subset of X.
180
Irwin W. Sandberg
subset of S that contains the points of an open ball in S. Of course, P can be taken to be (0, ∞) or {1, 2, 3, . . .}, etc. 2 Since exp{−β x − v } → 0 as x → ∞ when β > 0, we have X0 ⊂ X∞ when for each γ > 0 there is a compact subset X of X such that x ≥ γ for x ∈ X with x ∈ / X . In the following theorem (and as suggested earlier), n is an arbitrary positive integer and IRn stands for the linear space of real nvectors with the usual Euclidean norm. This norm is also denoted by · . Theorem 2: Let U be any convex subset of IRn that contains the points of an open ball in IRn , and let Q be any set of positive numbers that is closed under addition. Let f be a continuous function from IRn to IR such that lim f (x) = 0.
(3)
x→∞
Then for each > 0 there are a positive integer q, real numbers α1 , . . . , αq , numbers β1 , . . . , βq belonging to Q, and elements v1 , . . . , vq of U such that f (x) −
q
2
αk exp{−βk x − vk } <
(4)
k=1
for all x ∈ IRn . As mentioned earlier, Theorem 2 generalizes the result in [3] concerning approximation on compact convex subsets of IRn . 2.3
Classiﬁcation
Our main result concerning classiﬁcation, which follows, is an application of Theorem 1. Theorem 3: Assume that X is locally compact but not compact, that for each γ > 0 there is a compact subset Xγ of X such that x ≥ γ for x ∈ X with x∈ / Xγ , and that every bounded subset of X is bounded in S. 5 Let C1 , . . . , Cm be pairwise disjoint compact subsets of X, and let C0 = {x ∈ X: x ≥ ξ} where ξ > maxj sup{x : x ∈ Cj }. Let a0 , a1 , . . . , am be distinct real numbers with a0 = 0 and, with ρ = mini=j ai − aj , let Q: ∪j (aj − 0.25ρ, aj + 0.25ρ) → IR be speciﬁed by Q(a) = aj for a ∈ (aj − 0.25ρ, aj + 0.25ρ) and each j. Then there are a positive integer q, real numbers α1 , . . . , αq , numbers β1 , . . . , βq belonging to P , and elements v1 , . . . , vq of V such that q 2 Q αk exp{−βk x − vk } = aj k=1 5
A subset Y of X is bounded if sup{d(x, z): x ∈ Y } < ∞ for some z ∈ X, and Y is unbounded if it is not bounded.
Gaussian Radial Basis Functions and InnerProduct Spaces
181
for x ∈ Cj and j = 0, 1, . . . , m. The interpretation of Theorem 3 is that the family of classiﬁcation problems addressed by the theorem can be solved using a structure consisting of only a generalized RBF network followed by a quantizer. This family of classiﬁcation problems is more general than the one studied in [5],[6] in that here we consider also the problem of classifying signals x whose critical property is that the norm of x is suﬃciently large. This additional degree of generality is interesting from the viewpoint of understanding the capabilities of the structure considered, but it may not be of much signiﬁcance from a practical viewpoint. A result similar to Theorem 3, in which the additional class of signals is not considered and whose proof is along the same lines but is more direct, is given in Section 2.4. Examples: 3:
There are two particularly important examples of applications of Theorem
We may take S = X = IRn , in which case the classiﬁer described in the theorem classiﬁes elements of IRn . Alternatively, we may take S to be the set of continuous realvalued functions on the ndimensional interval [0, 1]n , with the inner product given by
x, y =
[0,1]n
x(w)y(w) dw.
In this case X can be selected to be any unbounded subset of S consisting of equicontinuous functions with d the uniform (i.e., max) metric, and C1 , . . . , Cm can be chosen to be any family of pairwise disjoint subsets of X that are closed and bounded. In this case the classiﬁer classiﬁes functions (e.g., images when n = 2 or 3). In particular, we can take the set of points of X to be any unbounded subset of S consisting of all Lipschitz continuous functions with a ﬁxed Lipschitz constant. Other examples can be given in which functions that are not necessarily continuous can be classiﬁed. 2.4
Related Results
Theorem 4 below is a version of Theorem 3 whose proof is along the same lines but is more direct. Theorem 4 is included because its proof, which is omitted in this version of the paper, provides additional understanding of the origin if its conclusion – that classiﬁcation of the elements of pairwise disjoint compact sets can be achieved as indicated. Theorem 4: Let C1 , . . . , Cm be pairwise disjoint compact subsets of X. Let a1 , . . . , am be distinct real numbers and, with ρ = mini=j ai − aj , let Q: ∪j (aj − 0.25ρ, aj + 0.25ρ) → IR be speciﬁed by Q(a) = aj for a ∈ (aj − 0.25ρ, aj + 0.25ρ) and each j. Then there are a positive integer q, real numbers α1 , . . . , αq , numbers
182
Irwin W. Sandberg
β1 , . . . , βq belonging to P , and elements v1 , . . . , vq of V such that q 2 αk exp{−βk x − vk } = aj Q k=1
for x ∈ Cj and j = 1, . . . , m. Concluding Comment: While natural questions arise concerning the results in this paper and speciﬁc practical problems and detailed implementations, such questions are not addressed here. We make no apologies for not having considered these questions. We are interested in this paper in questions concerning what is possible. Answers to such questions are often of considerable value.
References 1. J. Park and I. W. Sandberg, “Universal Approximation Using Radial BasisFunction Networks,” Neural Computation, vol. 3, no. 2, pp. 246–257, 1991. 2. J. Park and I. W. Sandberg, “Approximation and RadialBasis Function Networks,” Neural Computation, vol. 5, no. 2, pp. 305–316, March 1993. 3. E. J. Hartman, J. D. Keeler, and J. M. Kowalski, “Layered Neural Networks with Gaussian Hidden Units as Universal Approximators,” Neural Computation, vol.2, no.2, pp. 210–215, 1990. 4. M. H. Stone, “A Generalized Weierstrass Approximation Theorem,” In Studies in Modern Analysis, ed. R. C. Buck, vol. 1 of MAA Studies in Mathematics, pp. 30–87, Englewood Cliﬀs, NJ: PrenticeHall, March 1962. 5. I. W. Sandberg, “General Structures for Classiﬁcation,” IEEE Transactions on Circuits and Systems I, vol. 41, no. 5, pp. 372–376, May 1994. 6. I. W. Sandberg, J. T. Lo, C. Francourt, J. Principe, S. Katagiri, and S. Haykin Nonlinear Dynamical Systems: Feedforward Neural Network Perspectives, New York: John Wiley, 2001. 7. I. W. Sandberg, “Structure theorems for nonlinear systems,” Multidimensional Systems and Signal Processing, vol. 2, no. 3, pp. 267–286, 1991. (See also the Errata in vol. 3, no. 1, p. 101, 1992.) A conference version of the paper appears in Integral Methods in Science and Engineering90 (Proceedings of the International Conference on Integral Methods in Science and Engineering, Arlington, Texas, May 1518, 1990, ed. A. H. HajiSheikh), New York: Hemisphere Publishing, pp. 92–110, 1991. 8. R. A. DeVore and G. G. Lorentz, Constructive Approximation, New York: SpringerVerlag, 1993. 9. W. A. Sutherland, Introduction to Metric and Topological Spaces, Oxford: Clarendon Press, 1975.
Mixture of Probabilistic Factor Analysis Model and Its Applications Masahiro Tanaka Department of Information Science and Systems Engineering Faculty of Science and Engineering, Konan University 891 Okamoto, Higashinada, 6588501, Kobe, JAPAN m
[email protected] http://lotus.mis.konanu.ac.jp/∼tanaka/
Abstract. In this paper, the regression analysis is treated when the output estimate may take more than one value. This is an extension of the usual regression analysis and such cases may happen when the output is aﬀected by some unknown input. The stochastic model used in this paper is the mixture of probabilistic factor analysis model whose identiﬁcation scheme has been already developed by Tipping and Bishop. We will show the usefulness of our method by a numerical example.
1
Introduction
In many cases in regression analysis, it is necessary for the estimator to have the capability to express the nonlinear relation. Multilayer neural networks [3] are often used for this purpose. However, it is not necessarily useful for some cases. For example, if the output is very noisy, a deterministic output does not mean much. Another case neural network is useless is when the output may take certain separate values stochastically. Since the probability density function (PDF) describes the underlying distribution of the data, it can be used in various analysis including regression analysis. Moreover, it is often more powerful than multilayer neural networks in the problems mentioned above. Gaussian mixture model may be used for a wide class of nonGaussian model. However, observing the data locally, it is often the case the data only exist in a subspace. Thus numerical problems may arise. In such cases a lower dimensional model should be used. Thus probabilistic factor analysis, which is often called “principal component analysis model”[5], is a good candidate. This paper consists of the following sections. In section 2, the relation between the PDF and the regression analysis is explained. In section 3, the mixture of probabilistic factor analysis (PFA) model is introduced where the data may concentrate on subspaces locally. In section 4, the regression analysis corresponding to the mixture of PFA model is shown. In section 5, a result of the proposed regression analysis will be demonstrated. Section 6 is the conclusion. G. Dorﬀner, H. Bischof, and K. Hornik (Eds.): ICANN 2001, LNCS 2130, pp. 183–188, 2001. c SpringerVerlag Berlin Heidelberg 2001
184
2
Masahiro Tanaka
Probability Density Function and Regression Analysis
Regression analysis is a statistical method to estimate a variable z based on the vector consisting of the observed explanatory variables p, i.e. the problem is to give the function zˆ = f (p) (1) where the function f (·) is linear or nonlinear. Fig.1 shows an example of our motivation. If the contour of the joint distribution of the output and input is like this, the estimate of the output should be the bold line, not the dashed line that lies where the density is very low. However, the standard minimum variance estimate (e.g. [1]) zˆ(p) = E[zp] = zp(zp)dp (2) yields the output expressed by the dashed line.
z
desirable estimate
wrong decision
p Fig. 1. PDF and regression analysis
So, when such a distribution may arise, it is obviously important to estimate the bold line. Our problem of this paper is to give it by using the conditional PDF p(zp), or equivalently, joint PDF p(z, p) based on the training data set {(z(k), p(k)); k = 1, · · · , N } because p(zp) =
3
p(z, p) p(p)
(3)
Mixture of Probabilistic Factor Analysis Model
Suppose the data is expressed by one of the models x = W y (i) + µ(i) + e(i) with probability a(i)
(4)
Mixture of Probabilistic Factor Analysis Model and Its Applications
185
where x = [z  p ] , W is the factor loadings, x is the ndimensional observation data, and y is an mdimensional (m < n) latent variable. The latent variable is a stochastic variable, and is assumed to obey the normal distribution and
as
y (i) ∼ N (0, I)
(5)
e(i) ∼ N (0, (σ (i) )2 I)
(6)
Let us return to the model (4). Using this model, the PDF of x can be written
p(x) =
i
where
1 a(i) (2π)−n/2 C (i) −1/2 exp − (x − µ(i) )T (C (i) )−1 (x − µ(i) ) (7) 2 C (i) = W (i) (W (i) )T + (σ (i) )2 I
(8)
The model description and the identiﬁcation algorithm for this mixture of PPCA model has been shown by Tipping and Bishop [5]. However, the dimension of each kernel (submodel) was assumed to be known a priori. First we denote how the identiﬁcation procedure of PFA model is derived by using the EM algorithm([5]). The EM algorithm developed by Dempster et al. [2] is an algorithm of ML estimate of the parameters where some missing data exist in the model. The missing data is a stochastic variable. EM algorithm consists of two steps, and iterate them until the estimate converges. Estep is to take expectation of the complete likelihood, based on the observation and the model parameters that have been obtained by now. Mstep is to update the parameters so that the expectation of the complete likelihood is to be maximized. As we mentioned already, it is necessary to determine the dimensions of the latent variables. We propose to use AIC (Akaike Information Criterion), which is given by AIC = −2L + 2P (9) where L is the likelihood function and P is the number of free parameters. The problem is deﬁned as to minimize this criterion. If the number of the kernels is small and the observation dimension is also small, it is possible to calculate it by an exhaustive search. However, as these numbers grow, the search space expands exponentially, hence some heuristic method may be necessary. Metaheuristics such as the genetic algorithm also may be useful in some cases. A preliminary experiments have been done by [4].
4
Regression Analysis Based on Mixture of PFA
The estimate of z based on the observation p is given by zˆ = E[zp] = zp(zp)dz = E[zp, i]P (ip) i
(10)
186
Masahiro Tanaka
where the conditional expectation of the kernel i given the input is −1 (i) (p − µ(i) E[zp, i] = Wz(i) (Wp(i) )T Wp(i) (Wp(i) )T + (σ (i) )2 I p ) + µz
(11)
the a posteriori probability of the event occurrence from the kernel i is p(i) (p)a(i) P (ip) = (i) (i) i p (p)a
(12)
further where the stochastic density function of the input p for the kernel i is 1 (i) −n/2 (i) −1/2 (i) T (i) −1 (i) p (p) = (2π) C  exp − (p − µp ) (Cp ) (p − µp ) (13) 2 i Cp(i) = Wp(i) (Wp(i) )T + (σ (i) )2 I
(14)
and P (i) is the a priori probability of the kernel i. In certain cases, it is better not to mix the estimate of all the kernels. As was mentioned in the section 1, we may sometimes encounter a case where the output takes separate groups and taking the average of the estimates may severely degrade the output estimate. Such a case can happen when an important attribute is not included in the model. The output looks like taking quite distinct values stochastically. In such cases, it is obviously better to propose the estimate as a set. To do this, it is necessary to group the kernels, and within the group the outputs are mixed. This could be done by checking the sum of the a posteriori probabilities. If it comes to extremely low values, we can judge that the output is disconnected there. Fig. 2 shows an idea to do this. The horizontal axis is the output value, and the vertical axis is the joint PDF of the input and output. Note that the input is ﬁxed to the value of our interest. By scanning the joint PDF along the output value, we may ﬁnd the point where the joint PDF takes a very low value. Then we take this as a disjoint point, and the mixture of the output is done in the right part and the left part, separately. Although this processing is time consuming, it is not so troublesome because we only have to compute it for a small set of x of our interest. p(z,p)
group 1
group 2
L z1
z2 z Fig. 2. Boundaries
The following is the algorithm we propose.
Mixture of Probabilistic Factor Analysis Model and Its Applications
187
Step 1. Fix p. Step 2. Scan the joint PDF p(i) (z, p) along z for a certain interval [z1 , z2 ] with a predetermined increment value h. Step 3. Find the boundary points of y corresponding to the threshould as shown in Fig.2. Step 4. Partition the kernels into groups by using the information on which interval the ridge of the kernel density falls for the same x. Step 5. Suppose there arose m groups of kernels. Then we estimate the outputs yi by the equations (10)(12) where the probability is normalized within each group.
5
Numerical Example
We generated data based on x = W (i) y + µ(i) + e
(15)
for i = 1, · · · , 5, and W (i) , µ(i) are all distinct for the diﬀerent i, where all the variables are 2dimensional. In this experiment, we assume that all the parameters are already known, and our problem is to get the estimate of the output z based on the input p. Fig. 3 shows the estimate of the output z given the input p.
Fig. 3. Observation Data and the Estimates of the Outputs(larger threshold).
The small dot denotes the observation output. The symbol ◦ denotes the output estimate when the probability P (ip) is more than 0.8, denotes the ones when 0.2 ≤ P (ip) < 0.8 where i denotes the group number of the distribution subset. This shows that our estimation scheme yields an appropriate result for the estimate of multiple outputs with probability. Fig. 4 also shows the estimate of the output for a smaller threshold. For a smaller threshold, the distribution are wellseparated, and we have more number of estimates than the case for a larger threshold.
188
Masahiro Tanaka
Fig. 4. Observation Data and the Estimates of the Outputs(smaller threshold).
6
Conclusions
In this paper, the regression scheme has been shown for the mixture of PFA model. This model can treat wide class of data distribution where output may take multiple distinct candidates. Such a case can happen when an important explanatory variable is not included in the model. In the numerical example, the model was assumed to be known and it was used in the estimation of the ouptut. In the future research, both the identiﬁcation and the estimation of the output should be done simultaneously.
References 1. B.D.O. Anderson,B.D.O., Moore,J.B.: Optimal Filtering PrenticeHall (1979) 2. Dempster,A.P., Laird,N.M., Rubin,D.B.: Maximum Likelihood from Incomplete Data via the EM Algorithm, Journal of the Royal Statistical Society B391 (1977) 138 3. Rumelhart,D.E. Hinton,G.E., Williams, R.J.: Learning internal representation by error propagation. in Rumelhart,D.E. et al.: Parallel Distributed Processing: Explorations in the Microstructure of Cognition. 1, The MIT Press (1986) 321362 4. Tanaka,M.: Modeling of mixtures of principal component analysis model with genetic algorithm. Proc. 31st ISCIE International Symposium on Stochastic Systems Theory and Its Applications (2000) 157162 5. Tipping,M.E., Bishop,C.M.: Mixtures of Probabilistic Principal Component Analysers. Technical Report NCRG/97/003/, Aston University, UK (1998) 29 pages
Deferring the Learning for Better Generalization in Radial Basis Neural Networks José María Valls, Pedro Isasi, and Inés María Galván Carlos III University of Madrid, Computer Science Department. Avd Universidad, 30, 28911, Leganés, Madrid
[email protected] Abstract. The level of generalization of neural networks is heavily dependent on the quality of the training data. That is, some of the training patterns can be redundant or irrelevant. It has been shown that with careful dynamic selection of training patterns, better generalization performance may be obtained. Nevertheless, generalization is carried out independently of the novel patterns to be approximated. In this paper, we present a learning method that automatically selects the most appropriate training patterns to the new sample to be predicted. The proposed method has been applied to Radial Basis Neural Networks, whose generalization capability is usually very poor. The learning strategy slows down the response of the network in the generalisation phase. However, this does not introduces a significance limitation in the application of the method because of the fast training of Radial Basis Neural Networks.
1 Introduction The radial basis neural networks (RBNN) [1,2] are originated from the use of radial basis functions, as the Gaussian functions, in the solution of the real multivariate interpolation problem. RBNNs can be used for a wide range of application primarily because they can approximate any regular function [3]. Generally, the generalisation capability of the RBNN is poor because they are too specialised in the data training. Some authors have paid attention to the nature and size of the training set in order to improve the generalization ability of the networks. There is no guarantee that the generalization performance is improved by increasing the training set size [4]. It has been shown that with careful dynamic selection of training patterns, better generalisation performance may be obtained [5,6]. The idea of selecting the patterns to train the network from the available data about the domain is close of our approach. However, the aim in this work is to develop learning mechanisms such that the selection of patterns used in the training phase is based on novel samples, instead of based on other training patterns. Thus, the network will use its current knowledge of the new sample to have some deterministic control about what patterns should be used for training. In this work a selective training strategy has been developed to improve the generalisation capabilities of RBNN inspired on lazy strategies [7,8]. The learning G. Dorffner, H. Bischof, and K. Hornik (Eds.): ICANN 2001, LNCS 2130, pp.189–195, 2001. SpringerVerlag Berlin Heidelberg 2001
190
José María Valls, Pedro Isasi, and Inés María Galván
method proposed involve finding relevant data to answer a particular novel pattern and defer the decision of how to generalise beyond the training data until each new sample is encountered. Thus, the decision about how to generalise is carried out when a test pattern needs to be answered constructing local approximations. The main idea is to recognise from the whole training data set, the most similar patterns for each new pattern to be processed.
2 Automatic Selection of Training Data The method proposed in this work to train RBNN consists of selecting, from the whole training data, an appropriate subset of patterns in order to improve the answer of the network for a novel test pattern. The general idea for the selection of patterns is to include only and several times those patterns close to the novel sample. To develop this idea, let us consider q an arbitrary test pattern described by a ndimensional vector, q=(q1,..., qn), where qi represents the attributes of the instance q. Let X = {(x i , y i )/i = 1,..., N} be the whole available training data set, where xi are the input patterns and yi their respective target outputs. The steps to select the training set associated to the pattern q, which is named Xq, are the following: Step 1. A real value, dk, is associated to each training pattern (xk, yk). That value is defined in terms of the standard Euclidean distance. Step 2. A measure of frequency, fk=1/dk, where k=1,...N ,is associated to each training pattern (xk,yk). In order to obtain a relative frequency, the values fk are normalised in such way that the sum of the frequencies is equal to the number of training patterns in X. The relative frequencies, named as fnk, are obtained by: fn k =
fk S
where S =
1 N
N
!
i =1
fk
Thus:
N
! fn
k
= N
i =1
Step 3. The values fnk's previously calculated will be used to indicate how many times the training pattern (xk, yk) is repeated into the new training subset. Hence, they are transformed to natural numbers as: nk=int(fnk). At this point, each training pattern in X has associated a natural number, nk, which indicates how many times the pattern (xk, yk) has been used to train the RBNN when the new instance q is reached. Step 4. A new training pattern subset associated to the test pattern q, Xq, is built up. Once the training patterns are selected, the RBNN is trained with the new subset of patterns, Xq. Training a RBNN involves to determine the centers, the dilations or widths, and the weights. The centers are calculated in an unsupervised way using the Kmeans algorithm to classify the input space. After that, the dilations coefficients are calculated as the square root of the product of the distances from the respective center to its two nearest neighbours. Finally, the weights of the RBNN are estimated in a supervised way to minimise the mean square error measured in the training subset Xq.
Deferring the Learning for Better Generalization in Radial Basis Neural Networks
191
Let's Wq⊂ Xq the set resulting of removing the repeated patterns. In this work, the Kmeans algorithm has been modified in order to avoid the situation where many classes have no patterns at all. Thus, the initial values of the centers are set as: • •
Mq, the centroid of the set Wq, is evaluated. k centers are randomly generated (c1q, c2q, ....ckq), such as c jq
− M q < ε , j=1,...k, and ε is a very small real number.
3 Experimental Results The deferred learning method proposed in this work has been applied to two different approximation problems. In order to validate the proposed method, RBNNs have also been trained as usual, this is, the network is trained using the whole training data set. 3.1 Hermite Polynomial Approximation The Hermite polynomial is given by the equation: 2 2 F(x)=1.1(1x+2x )exp(1/2x ) A random sampling of the interval [4 , 4] is used in obtaining 160 inputoutput points for the training set and 65 inputoutput data for the test set. RBNNs with different number of neurones have been trained using the whole training data. The training process is stopped either when 150 cycles are performed or when the derivative of the train error equals zero. The generalization capability of the trained networks has been measured using the test set and the mean square errors achieved by the networks. In figure 1 (a), the mean errors obtained for different architectures are shown. The best results have been achieved using a RBNN with 25 neurones, although no significant differences have been found for networks between 10 and 25 neurones. It is observed that the test error can not be improved even if more learning cycles are performed using the whole training data set. (b) Mean Error  Selective Learning 0.005
0.004
0.004
Mean Error
Mean Error
(a) Mean Error. Traditional learning 0.005 0.003 0.002 0.001
0.003 0.002 0.001 0
0 0
20 40 Number of neurones
60
0
5 10 Number of neurones
15
Fig. 1. Mean error on the test set for the Hermite function achieved by different architectures.
The selective learning method proposed in this work has also been used to train RBNNs with different architectures during 150 learning cycles, and their generalization capability has been tested. Mean square errors on the test set achieved by these
192
José María Valls, Pedro Isasi, and Inés María Galván
networks are shown in figure 1 (b). In this case, only 3 neurones are necessary to obtain the best results, and using more than 8 increases the error to the level of the RBNN trained with the classical method. The value of the mean square test error achieved by the best network, the one with 25 neurones, trained with the whole training set, is shown in Table 1. Using the specific learning method proposed in this work, the best network, the one with 3 neurones, has also been evaluated. In that case, the mean square error over the test set is 2.5 times lower. Table 1. Performance of different training methods for the Hermite function
Training with the whole data set Training with a selection of data
Mean square error Number of Neurones 4 2.49x10 25 4 0.649x10 3
To show the difference between both learning methods, the errors for each test pattern are represented in figure 2. In this figure it can be observed that the generalization error corresponds initially to some difficult regions of the function. It is specially in these regions where our approach is able to drastically reduce the error and to find a good approximation. 0 .0 0 3
S e le c tive L e a rn in g
Square Error
0 .0 0 2 5
T ra d itio n a l L e a rn in g
0 .0 0 2 0 .0 0 1 5 0 .0 0 1 0 .0 0 0 5 0 0
0 .2
0 .4 0 .6 T e s t P a tte rn
0 .8
1
Fig. 2. Hermite function: Square errors for each test patterns.
3.2 PiecewiseDefined Function Approximation This function has been chosen because of the poor generalisation performance that RBNN presents when approximating it. The function is given by the equation: % − 2.186x − 12.864 if − 10 ≤ x < − 2 " " f(x) = $ 4.246x if − 2 ≤ x < 0 " 10e ( − 0.05x − 0.5) sin [(0.03x + 0.7)x ] if 0 ≤ x ≤ 10 #" The original training set is composed by 120 inputoutput points randomly generated by an uniform distribution in the interval [10,10]. The test set is composed by 80 inputoutput points generated in the same way as the points in the training set. As in the previous experiment, RBNNs with different number of neurones have been trained, using the whole training data until the convergence of the network has been reached, that is, either when 150 cycles are performed or when the derivative of the train error equals zero. In figure 3 (a), the mean square errors obtained for differ
Deferring the Learning for Better Generalization in Radial Basis Neural Networks
193
ent architectures are shown. The best results have been achieved using a RBNN with 20 neurones, although no significant differences have been found for networks between 2 and 30 neurones. The selective learning method proposed in this work has also been used to train RBNNs with different architectures during 150 learning cycles, and their generalization capability has been tested. Mean square errors on the test set achieved by these networks are shown in figure 3 (b). In this case, only 4 neurones are necessary to obtain the best results, and using more than 8 increases the error to the level of the RBNN trained with the clasical method. The mean square errors obtained in both cases over the test set, and the learning cycles involved are shown in table 2. As it is possible to observe in table 2, the mean square error over the test set is significantly reduced when an appropriate selection of patterns is made. (b) Mean Error  Selective Learning 0.005 Mean Error
Mean Error
(a) Mean Error  Traditional Learning 0.005 0.004 0.003 0.002 0.001 0
0.004 0.003 0.002 0.001 0
0
20 40 60 Number of neurones
80
0
2
4 6 8 Number of neurones
10
Fig. 3. Mean error on the test set for the Piecewise function achieved by different architectures. Table 2. Performance of different training methods for the piecewise function.
Training with the whole data set Training with a selection of data
Mean square error 4 7.8x10 4 2.07x10
Number of Neurones 20 4
As in the previous experiments, the computational cost is higher when the deferred training method is used, although, on the other hand, the number of neurones is smaller, and the RBNN is trained in a shorter time. In that case, as in the previous approximation problem, the RBNNs has been trained until they reach the convergence. Thus, the generalisation capability of the network using the whole training data can not be improved if it is trained for more learning cycles. It has been observed how the network trained with the whole training data has some difficulties to approximate the points in which the function changes their tendency. This can be described as a deficiency in the generalisation capabilities of the network. However, the generalisation is improved when an appropriate selection of patterns is made. The proposed method is able to provide better approximations of points even when the tendency of the function is changed. Figure 4 shows the errors committed by the different learning strategy for each test pattern. Most of the test patterns are better approximated when the specific learning method is used to train RBNN.
194
José María Valls, Pedro Isasi, and Inés María Galván
0.007 Square Error
0.006
Selective Learning
0.005
Traditional Learning
0.004 0.003 0.002 0.001
77
73
69
65
61
57
53
Patterns
49
45
41
37
33
29
25
21
17
9
13
5
1
0
Fig. 4. Piecewisedefined function: Square errors for each set patterns.
4 Conclusions The results presented in the previous section show that if RBNNs are trained with a selection of training patterns, the generalisation performance of the network is improved. The selection of the most relevant training patterns, helps to obtain RBNN‘s able to better approximate complex functions. The selective method seems to be more sensitive to the number of hidden neurones than the traditional one. However, when the selective method is used less experiments must be realized because it gets better generalization results with less hidden neurones. Due to the reduced input space, the election of the initial centeres for the Kmeans algorithm is extremely important. Previous experiments have shown that if the kmeans algorithm is used as usual, a lot of clusters remain empty and many hidden neurones in the RBNN are useless prejudicing the network behaviour. The proposed method to determine the initial centroid of the cluster avoids this problem. The specific learning methods proposed in this work involves storing the training data in memory, and finding relevant data to answer a particular test pattern. Thus, the decision about how to generalise is carried out when a test pattern needs to be answered constructing local approximations. That implies a large computational cost because the network has to been trained when a new sample test is presented. However, that is not a disadvantage of the method because in many cases that computational effort can be broached and to achieve lower approximation errors is an important advantage. Moreover, the number of neurones of the network trained with the selective method is much lower, thus, the computational effort is not as high as it appears to be. The proposed method uses the Euclidean distance as similarity measure. However, the method is flexible to incorporate other similarity measures, which will be studied in the future.
References 1. Moody J.E. and Darken C.J.: Fast Learning in Networks of LocallyTuned Processing Units. Neural Computation 1, (1989), 281294.
Deferring the Learning for Better Generalization in Radial Basis Neural Networks
195
2. Poggio T. and Girosi F.: Networks for approximation and learning. Proceedings of the IEEE, 78, (1990), 14811497.
3. Park J. and Sandberg I.W.: Approximation and RadialBasisFunction Networks. Neural Computation, 5, (1993), 305316. 4. AbuMostafa Y. S.: The VapnikChervonenkis dimension: information versus complexity in learning. Neural Conputation 1, (1989), 312317.
5. Cohn D., L. Atlas and R. Ladner: Improving Generalisation with Active Learning, Machine Learning, Vol 15, (1994), 201221. 6. Vijayakumar S. And H. Ogawa: Improving Generalization Abolity through Active Learning. IEICE Transactions on Information and Sytems. Vol E82D,2, (1999), 480487. 7. Atkeson C. G., A. W. Moore and S. Schaal. Locally Weighted Learning. Artificial Intelligence Review 11, (1997), 1173. 8. Wettschereck D., D.W. Aha and T. Mohri: A review and Empirical Evaluation of feature weighting methods for a class of lazy learning algorithms. Artificial Intelligence Review 11, (1997), 273314.
Improvement of Cluster Detection and Labeling Neural Network by Introducing Elliptical Basis Function Christophe Lurette1 and Stéphane Lecoeuche1,2 1 Laboratoire
I3D, USTL, Bâtiment P2, 59655 Villeneuve d’Ascq, France Phone: + 33 320 434 876, Fax: +33 320 436 567
[email protected] 2 E.I.P.C., Campus de la Malassise, BP 39, 62967 Longuenesse Cedex, France Phone: +33 321 388 510 Fax: +33 321 388 505,
[email protected] Abstract. This paper proposes an improvement of the Cluster Detection and Labeling Neural Network. The original classifier criterion has been modified by introducing Elliptical Basis Functions (EBF) as transfer function of the hidden neurons. In the original CDL network, a similarity criterion is used to determine the membership to prototypes and then to classes. By introducing EBF, we have introduced degrees of membership leading to elliptic shape of classes. In this paper, the functioning of the original CDL network is summarized. Then, the improvements of the architecture in terms of network architecture, neuron activation function and learning stages are described. We present the improvement with EBF and the modification of the autoadaptation neural network abilities. As validations of our architecture, we illustrate its benefits in comparison with the original CDL network.
1. Introduction Neural networks are widely used for classification of natural data [1][2]. In many applications the data are precollected and preprocessed in order to be classified by a supervised classifier. The neural networks have shown their abilities to achieve this work easily. A lot of improvements have been developed in order to label the data space representation in very precise partitions by describing connex shapes [3], radial shapes [4], elliptic shapes [5], … In some applications like classification of time series data, some future cases might not been predicted and then new partitions of the space would have to be created. For labeling these new kinds of data, unsupervised and autoadaptation abilities have been introduced in the classification techniques [6][7]. A lot of neural networks have been developed with unsupervised abilities or with autoadaptation abilities but a few of them like the Cluster Detection and Labeling Network [8] have been developed with both abilities. For this kind of network, complex shapes of classes are defined by creating a lot of prototypes for each class, each prototype describing a small round cluster. We present an adaptation of the CDL network in order to improve its architecture and its real abilities to label any irregular cluster shape with less prototypes.
G. Dorffner, H. Bischof, and K. Hornik (Eds.): ICANN 2001, LNCS 2130, pp. 196–202, 2001. SpringerVerlag Berlin Heidelberg 2001
Improvement of Cluster Detection and Labeling Neural Network
197
2. The Cluster Detection and Labeling (CDL) 2.1. CDL Structure and Principle The CDL network is a feedforward network, developed in 1998 by Eltoft and DeFigueiredo [8]. It consists of four layers as shown in figure 1.
Fig. 1. CDL structure.
Fig. 2. CDL principle.
The input layer consists of as many neurons as components of the input vectors. The hidden layer is totally connected to the input layer, each neuron represents a prototype of a class, the number of neurons is variable. A prototype regroups a set of close input vectors. The output layer consists of as many neurons as detected classes. The annex layer makes possible the determination of the similarity threshold between an input vector and each prototype defined by neurons on the hidden layer. The connection weights between the hidden layer and the output layer characterize the relation between the prototypes and their class. The learning and the adaptation of the network are made in three stages, as described in figure 2. First Stage The first and main stage is called ”classification without fusion”. The particularity of this stage is to make possible the creation of a new prototype if the similarity between the presented input vector and the known prototypes is very low. Similarly, a new class is created when a newly created prototype is very different from the already known prototypes. The main criterion for the creation of new prototypes or new classes is based on a distance between a new example Xi and the known prototypes Pj. In the original architecture, the similarity criterion (1) is defined as the inverse squared Euclidean distance. (1) 1 s (Pj , X i ) =
D (Pj , X i )
2
The similarity criterion between the example and each prototype Pj is compared j
j
th
to two similarity thresholds t min and t max, the output of the j neuron of the annex layer. As a matter of fact, these thresholds are computed by the annex layer by using ξ min and ξ max , which are defined for each iteration, as described in figure 2. For example, when a new sample is presented, the annex layer calculates for each j j prototype the two corresponding thresholds t min and t max. Then, the hidden layer makes
198
Christophe Lurette and Stéphane Lecoeuche
the comparison with the similarity criterion. So, different cases are possible, they are summarized in Table 1. Table 1. Different cases for the first stage. If
1st case
2nd case
((
3 case
th
4 case
j
i
)
min
j
)
The similarity criterion is larger than the first threshold but smaller than the second threshold for prototypes Pj that belong to the same class.
(t
rd
Then
The similarity criterion between an example and any known prototype is smaller that the first threshold s P , X < t j for all P .
j min
j < s (Pj , X i ) < tmax
)
The similarity criterion is larger than the two thresholds for some prototypes Pj that belong to the same class
(t
j min
j < t max < s (Pj , X i ) for Pj
)
The similarity criterion is larger than the first threshold for multiple prototypes that belong to different classes.
The example is not close to any prototype. It is necessary to create a new prototype (a new neuron on the hidden layer) and a new class (a new neuron on the output layer). The example is close to these prototypes, but not enough close to be associated with one of them. It is necessary to create a new prototype. This new prototype and prototypes Pj belong to the same class. The example is associated to the prototype Pj and to its class. The ambiguity will be analyzed during the next stage of the learning procedure.
Second Stage The ”class fusion” stage regroups prototypes and classes that are close in the space representation. If an ambiguity is detected during the previous stage, as shown in the th 4 case, the ”class fusion” stage resolves it by merging the different ambiguous prototypes into the same class. So, the output layer is modified by the elimination of the neurons that defined the ambiguous classes in order to let a unique neuron. This neuron is the result of the merged classes. Third Stage The ”class evaluation” stage makes possible the characterization of the different classes in order to modify the thresholds ξmin and ξmax, that are used in the computation of the similarity criterion. For example, a threshold could be used to eliminate classes containing too few assigned examples. These examples are marked as ”unclassed”, the similarity thresholds are modified in an iterative manner, and the ”unclassed” examples are again presented to the neural network. Further information are found in reference [8]. 2.2. Advantages and Constraints of the CDL The principal advantage of the CDL is its autoadaptive architecture thanks to the two thresholds, used for the creation of prototypes and classes. But on the other hand, the similarity computation with the definition of the annex layer is too much. Furthermore, the choice of the two similarity thresholds is not easy, in the way that
Improvement of Cluster Detection and Labeling Neural Network
199
the original structure of the CDL is extremely sensitive in the very close neighborhood of the prototypes.
3. Our Adaptation of the CDL 3.1. A New Activation Function for the Neurons of the Hidden Layer Our first adaptation have consisted in modifying the activation function of the neurons of the hidden layer. We have introduced an HyperElliptical function of activation, so the output of the hidden layer defines a membership degree (2) of the prototypes, by comparison with the similarity criterion of the CDL. µ ( Pj , X i ) = exp(−
1 ( d ( Pj , X i ))²) 2α Pj
(2)
where α Pj is a normalization factor defined in section 3.2. The choice of this new activation function has two advantages. In a first time, it’s less sensitive than the inverse squared Euclidean distance for close neighborhood of the prototypes. More over, by introducing an estimation of membership probability for each prototype, we can define a membership degree for each class. Lastly, for the distance, we have chosen the Mahalanobis distance. The Mahalanobis distance is chosen to give the prototypes an HyperElliptical shape thanks to its mean vector M Pj and its covariance matrix ! Pj . 3.2. An Adaptive and Iterative Learning of the Prototypes In the proposed adaptation, two thresholds µ min et µ max for creating new prototypes or new classes are still used, but they are fixed for all iterations and defined for all prototypes, furthermore the annex layer is no longer use and eliminated. More over with our method, we are able to adapt directly the existing prototypes. When a new example X i is associated to a known prototype Pj , that’s to say µ ( Pj , X i ) > µ max , we adapt only this prototype, in a same way than [7], but with the full covariance matrix ! Pj [5]. This adaptation is performed in an iterative manner using the following relations. M Pj
Σ Pj
k +1
=
k +1
k 1 k * Xi * M Pj + k +1 k +1 1 k k + ( X i − M Pj )T ( X i − M Pj ) (k + 1)
=
k −1 k Σ Pj k
(3) (4)
In addition, we have defined a coefficient α Pj in the relation (2). [5] has introduced an equivalent coefficient as a smoothing parameter. In our situation, α Pj assures a membership degree greater than µ max for all the samples that have been associated to the adapted prototype. This coefficient is also modified when we adapt a prototype, according to (5).
200
Christophe Lurette and Stéphane Lecoeuche
α Pj k +1 = max Xi∈Pj (
( X i − M Pj
k +1 T
) ( X i − M Pj
− 2 ln(µ max )
k +1
)
)
(5)
So, our first work has focused on the modification of the first stage, with the definition of the elliptical shape of the prototypes and especially their autoadaptation. The use of an Elliptical Basis Function for the definition of the hidden layer, leads to a better definition of the prototypes, as shown in figure 3, in the way that a prototype is not defined by one sample of the learning data and two thresholds, but by several samples of the data set.
Fig 3. Its figure illustrates by a theoretic way the benefits of our adaptation. The number of prototypes is reducing, a unique prototype is creating.
3.3. Evolution of the Different Stages We have described, in section 3.1 and 3.2, our adaptation of the CDL in terms of architecture. Here, we’re going to present the modifications of the principle of learning by comparison with the initial three stages. In the first stage ”classification without fusion”, the different cases defined in Table 1 are the same, except the third. In the case of µ ( Pj , X i ) > µ max for a prototype Pj , then we associate the example to the prototype and to its class, but we also adapt
the prototype Pj , according to section 3.2. However, the use of (3) and (4) imposes to have already a minimum number of examples associated to the prototype. That is the P reason why we have defined a threshold N min on the number of examples associated with a prototype. In the same way, we have defined a value !ini for the initialization of the covariance matrix. In the second stage ”class fusion”, no modification has been realized. In the third stage ”class evaluation”, we have kept the original criterion and added P a stage of ”prototype evaluation” in order to verify the threshold N min on the number of examples for the definition of a prototype. In our version, no threshold is recomputed for the next iteration.
4. Experimental Results So as to validate the benefits of our different adaptations of the CDL, we have experimented it on simulated data.
Improvement of Cluster Detection and Labeling Neural Network
201
4.1 Irregular and Elongated Classes The simulated data are composed of three classes which are irregular and elongated, as shown in figure 4a), where the samples are the dots.
a)
b)
CDL Classic
CDL Adapted
Number of Prototypes
74
15
Number of Classes
3
3
Number of Iterations
2
5
N° of prototypes defined for less than 12 samples
24
1
Fig 4. Its shows a comparison between the classic and adapted CDL use, on simulated data, with some results summarized in figure 4b). The CDL classic is initialized with ξ min = 20 ; C =15. Its prototypes are represented by circles in fig 4a). Our adaptation of the ξ max = 40 ; Nmin P C = 7 ; Nmin =10. It defines only 15 CDL is used with µ min = 0.3; µ max = 0.6 ; !ini = 0.22 Ι ; N min
prototypes (cross dots)
Our adaptation of the CDL has the advantage to create less prototypes than the original version even if the number of iterations is higher. After computing, our algorithm found three classes whose each is defined by several prototypes which have been adapted in position by M Pj , orientation by ! Pj , and volume by α Pj . The figure 5a) illustrates the found prototypes, their positioning and their shape for the two thresholds. As a matter of fact, no one is defined at the boundary of classes by opposition with the classic CDL. The use of elliptical basis function for the definition of the function of activation of the prototypes (hidden layer) permits to define a membership degree Ψ for each class (output layer). (6) Ψ (C , X ) = max(1, µ (P , X )) k
i
!
j
i
Pj∈Ck
So, in figure 5b), we have represented the boundary of each class by using the definition of membership described in (6). Each boundary is defined with a threshold ψ min or ψ max likeness the thresholds µ min and µ max for the prototypes.
202
Christophe Lurette and Stéphane Lecoeuche
1.5
1.5
a)
1
1
0.5
0.5
0
0
0.5
0.5
1
1
1.5 2
1.5
1
0.5
0
0.5
1
1.5
1.5 2
b)
1.5
1
0.5
0
0.5
1
1.5
Fig. 5. a) Prototypes and their two membership thresholds, b) Definition of a boundary for each class with membership degree of the classes.
5. Conclusion As conclusion, the introduction of the Elliptical Basis Function for the definition of the activation function, has several advantages, whose the first is to simplify the architecture of the CDL by eliminating the annex layer. Moreover, we have shown that with our version we can define a boundary for each class, thanks to the shape defined by the prototypes. The use of our adaptation should be interesting for use in diagnosis, for detection of class evolution, and will be the subject of our future works.
References 1. Looney, C.: Pattern Recognition using Neural Networks. Theory and Algorithms for Engineers and Scientists. Oxford University Press (1997) 2. Bishop, C.: Neural Networks for Pattern Recognition. Oxford University Press (1995) 3. Gupta, L., McAvoy, Phegley, J.: Classification of temporal sequences via prediction using the simple recurrent neural network. Pattern Recognition, Vol. 33. N°10 (2000) 17591770 4. Gomm, J.B., Yu, D.L.: Selecting Radial Basis Function Network Centers with Recursive Orthogonal Least Squares Training. IEEE Transactions on Neural Networks, Vol. 11. N°2. (2000) 306314 5. Mak, M.W., Kung, S.Y.: Estimation of Elliptical Basis Function Parameters by the EM Algorithm with Application to Speaker Verification . IEEE Transactions on Neural Networks, Vol. 11. N°4. (2000) 961969 6. Mao, J., Jain, A.K.: A SelfOrganizing Network for HyperEllipsoïdal Clustering (HEC). IEEE Transactions on Neural Networks. Vol. 11. N°4. (1996) 1629 7. Zheng, N., Zhang, Z., Shi, G., Qiao, Y.: SelfCreating and Adaptive Learning of RBF Networks : Merging SoftCompetition Clustering Algorithm with Network Growth Technique. International Joint Conference on Neural Networks (1999) 8. Eltoft, T., Rui Defigueiredo, J.P.: A New Neural Network for ClusterDetectionandLabeling. IEEE Transactions on Neural Networks. Vol. 9. N°5. (1998) 10211035
Independent Variable Group Analysis Krista Lagus, Esa Alhoniemi, and Harri Valpola Neural Networks Research Centre, Helsinki University of Technology P.O. Box 5400, FIN02015 HUT, Finland {krista.lagus,esa.alhoniemi,harri.valpola}@hut.fi Abstract. When modeling large problems with limited representational resources, it is important to be able to construct compact models of the data. Structuring the problem into subproblems that can be modeled independently is a means for achieving compactness. In this article we introduce Independent Variable Group Analysis (IVGA), a practical, efﬁcient, and general approach for obtaining sparse codes. We apply the IVGA approach for a situation where the dependences within variable groups are modeled using vector quantization. In particular, we derive a cost function needed for model optimization with VQ. Experimental results are presented to show that variables are grouped according to statistical independence, and that a more compact model ensues due to the algorithm.
1
Introduction
The goal of unsupervised learning is to extract an eﬃcient representation of the statistical structure implicit in the observations. A good model is both accurate and simple in terms of model complexity, i.e., it forms a compact representation of the input data. Sparse coding, independent components, etc., can all be justiﬁed from the point of view of constructing compact representations. Learning amounts to searching in a model space by optimization of some cost function that measures both the accuracy of representation and—ideally at least—the model complexity. In problems with a large number of diverse observations there are often groups of variables which have strong mutual dependences within the group but which can be considered practically independent of the variables in other groups. It can be expected that the larger the problem domain, the more independent groups there are1 . Estimating a model for each group separately produces a more compact representation than applying the model to the whole set of variables. Compact representations are computationally beneﬁcial and, moreover, oﬀer better generalization. 1.1
Independent Variable Group Analysis
We suggest an approach, independent variable group analysis (IVGA), where the dependences of variables within a group are modeled, whereas the dependences 1
Consider, for instance, creating a model of the whole world...
G. Dorﬀner, H. Bischof, and K. Hornik (Eds.): ICANN 2001, LNCS 2130, pp. 203–210, 2001. c SpringerVerlag Berlin Heidelberg 2001
204
Krista Lagus, Esa Alhoniemi, and Harri Valpola
between the groups are neglected. This generates a pressure towards building groups whose variables are mutually dependent but are largely independent of the variables in other groups. Usually such variable grouping is performed by a domain expert, prior to modeling with automatic, adaptive methods. However, we claim that it is worthwhile and feasible to try to obtain groupings automatically in order to create structured, eﬃcient largescale models automatically. In IVGA, each separate group can be modeled using any method, as long as a cost function that measures the quality of the representation, including both compactness and faithfullness, is derived for the model family. In particular, such a cost function can be derived within the statistical framework to generative modeling. In this approach, the best model is the one that gives the highest probability to the observations. Furthermore, we demonstrate the approach by describing and evaluating an algorithm, IVGAVQ , that can be used for computing IVGA when the model space used for modeling dependences between variables consists of vector quantizers (VQs). Variational EMalgorithm is used for adapting the VQs and for computing the cost function. 1.2
Structure of the Article
The structure of the rest of the article is as follows: ﬁrst we describe two related models in section 2. In section 3 we introduce an algorithm for computing IVGA where VQs are used in modeling each variable group. Section 4 describes an experiment performed to validate the feasibility of the approach and to demonstrate the algorithm. Conclusions are presented in section 5.
2
Related Models
In multidimensional independent component analysis (MICA) [1], and its subsequent development independent subspace analysis (ISA) [4], the idea is to ﬁnd independent linear feature subspaces that can be used to reconstruct the data efﬁciently. Thus each subspace is able to model the linear dependences in terms of the latent directions deﬁning the subspace. The approach bears resemblance to IVGA which can be seen as a nonlinear version of MICA with the additional requirement which restricts the subspaces to be spanned by subsets of the variable axes. Factorial vector quantization (FVQ) [3,7] can be seen as a nonlinear version of MICA. It uses several diﬀerent VQs that cooperate to reconstruct the observations and is thus very similar to IVGAVQ . The structural diﬀerences of the FVQ model compared to the IVGAVQ model are: in FVQ (1) each vector in each pool or group contains all the components (variables) of the original input vector, and (2) these are summed to produce the value of a single output. In contrast, in IVGAVQ each variable group is modeled by exactly one vector quantizer (VQ). This leads to eﬃcient computation as each VQ can operate independently as opposed to FVQ where the winners are found iteratively.
Independent Variable Group Analysis FVQ
MICA / ISA
Hinton & Zemel (1994)
Cardoso (1998); Hyvärinen & Hoyer (2000)
+
x
+
x1
x
...
...
...
... Subspace of the original space (linear)
205
+
+ x9
...
VQ for all the variables (nonlinear)
IVGA x
Any method for modeling dependencies within a variable group
Fig. 1. Schematic illustrations of IVGA and related algorithms, namely MICA/ISA and FVQ that each look for multidimensional feature subspaces in eﬀect by maximizing a statistical independency criterion. The input x is here 9dimensional. The numbers of squares in FVQ and IVGA denote the numbers of variables modeled in each submodel, and the numbers of black arrows in MICA the dimensionality of the subspaces.
Figure 1 illustrates the model structures of MICA/ISA, FVQ, and IVGA.
3
IVGAVQ : Algorithm for IVGA with VQs as Data Models
Any IVGA algorithm consists of two parts, (1) grouping variables, and (2) constructing a separate model for each variable group. An independent variable grouping is obtained by comparing models with diﬀerent groupings using a suitable cost function. In principle any model can be used, if the necessary cost function is derived for the model family. As a cost function one can use negative loglikelihood of the data given the model, namely − ln p(xH). The total model cost Ltot needed for comparing vari= able groupings is the sum of costs of individual variable groups L tot g Lg = − ln p(x H ), where g is the index of a group of variables, and x and Hg g g g g are the data and the model related to that variable group, respectively.
206
Krista Lagus, Esa Alhoniemi, and Harri Valpola
In all cases where the IVGA approach is used, the same problem arises: it is computationally infeasible to try all possible diﬀerent groupings of D variables into G distinct groups, where G varies from 1 to D. Thus, especially when D is large, some heuristic optimization strategy has to be utilized. In our experiments, all the variables were initially assigned to groups of their own and then moved one by one from group to another if the movement reduced the value of the cost Ltot . In addition, every now and then (1) IVGA was recursively run for one group or the union of two groups (the depth of recursion was limited to one) or (2) a merge of two groups was considered. 3.1
The VQ Model and the Cost Function
To demonstrate the approach, we will describe IVGAVQ where each variable group is modeled using a diﬀerent VQ. From now on, we will refer to the cost function of a single variable group leaving out the group index g. The vector quantization model used for a single variable group consists of codebook vectors µ(i), i = 1, . . . , C (described by their means and a common variance) and indices of the winners w(t) for each data vector x(t), t = 1, . . . , N . For ﬁnding − ln p(xH) we use variational EMalgorithm with θ = {µ, w} as missing observations. In the Ephase, an upper bound of the cost is minimized [2,5]. The rest of the parameters are included in H, i.e. we use ML estimates for the following: c, the hyper parameter governing the prior probability for a codebook vector to be a winner; σ 2x , the diagonal elements of the common covariance matrix; and µµ and σ 2µ , the hyper parameters governing the mean and variance of the prior distribution of the codebook vectors. The upper bound for − ln p(xH) = − ln p(x, θH)dθ is obtained from p(x,θH) − ln p(x, θH)dθ = − ln q(θ) p(x,θH) q(θ) dθ ≤ − q(θ) ln q(θ) dθ , where the inequality follows from the convexity of − ln x by Jensen’s inequality. H and q(θ) are alternately optimized but instead of ﬁnding the intractable global optimum q(θ) = p(θx, H), we restrict to ﬁnding the optimum among the tractable family of distributions of the form q(µ, w) = q(µ)q(w). Substituting p(x, θH) = p(xw, µ, σ 2x )p(wc)p(µµµ , σ 2µ ), denoting the integral over q(θ) by E{} and arranging the terms then yields errors winners codebook
q(w) q(µ) + E ln . L = E − ln p(xw, µ, σ 2x ) + E ln p(wc) p(µµµ , σ 2µ ) After substituting with the gaussian model for the codebook vectors the cost function becomes2 2
A similar derivation for the cost function for FVQ is given in [3,7]. For simplicity, we approximate the posterior probability of winners by the best vector only. Without signiﬁcant additional computational cost all the winners could be taken into account. Individual variances for model vectors could be included as well, and we plan to do this in the future.
Independent Variable Group Analysis
L=
d N µ ˜j (w(t)) + (xj (t) − µ ¯j (w(t)))2
207
d N N 2 ln 2πσ − ln c(w(t)) xj 2 2σxj 2 j=1 t=1 j=1 t=1 d C σµ2 j (i) µ ˜j (i) + (¯ µj (i) − µµ j (i))2 Cd . ln − + + 2 (i) 2 µ ˜ (i) 2σ j µ j j=1 i=1
+
¯ and µ ˜ are the mean and variance of the posterior distribution of the where µ codebook vectors, and d is the dimensionality of the subspace, i.e. the size of the variable group. Minimization of the Cost Function. Minimization of Eq. 1 is carried out so that C is ﬁxed and the cost is minimized with respect to each of the variables ¯ µ. ˜ The following steps are repeated iteratively as long as the c, w, σ 2x , µµ , σ 2µ , µ, value of the cost function decreases or maximum iteration count is reached (in our experiments 100 iterations). This is repeated for various values of C. 1. Winner selection.
d 2 µ ˜j (i) + (xj (t) − µ ¯j (i)) ∂L = 0 =⇒ w(t) = arg mini − ln c(i) + 2 ∂w(t) 2σ xj j=1
2. Update of the winner. (a) Update of the posterior mean of the codebook vector. 2 σxj µµ j + σµ2 j w(t)=i xj (t) ∂L = 0 =⇒ µ ¯j (i) = 2 + σ 2 f (i) ∂µ ¯j (i) σxj µj (b) Update of the posterior variance of the codebook vector. 2 σµ2 j σxj ∂L = 0 =⇒ µ ˜j (i) = 2 ∂µ ˜j (i) σxj + σµ2 j f (i)
3. Update of the data variance. N ∂L 1 2 µ ˜j (w(t)) + (xj (t) − µ = 0 =⇒ σ = ¯j (w(t)))2 xj 2 ∂σxj N t=1
4. Updates of codebook frequency prior and parameters of the prior distribution. f (i) + 1 c(i) = , N +C
µµ j
C 1 = µ ¯j (i), C i=1
σµ2 j
C 1 2 = [˜ µj (i) + (¯ µj (i) − µµ j ) ] C i=1
Here f (i) is the number of hits of µ(i), i.e., f (i) = #{tw(t) = i}.
208
4
Krista Lagus, Esa Alhoniemi, and Harri Valpola
Experiments
Four small experiments were run both to (1) verify the general IVGA principle (Experiments 1 vs. 2), i.e, that it is useful to model independent variable groups separately, and (2) to study the performance of the presented IVGAVQ algorithm in optimizing the model for the data and in ﬁnding independent variable groups (Experiments 1 vs. 3, Experiments 2 vs. 4). In each experiment, the best one out of several trials is reported (for VQs this meant about 10000 trials, for IVGAVQ , considerably fewer). With each approach we used a roughly equal amount of computation time. In reporting results we report the number of codebook vector parameters—the number of the rest of the parameters is substantially smaller and constant in all experiments. Data and Variables. As a data set we used features from 1000 images that have been used for contentbased image retrieval in [6]3 . Each image is represented by a vector of 144 features (variables). The features fall naturally into three categories related to their origin: FFT, RGB, and texture. It is reasonable to assume that dependences between variables in diﬀerent categories are weak4 . Thus, the data provides an ideal validation for both the IVGAVQ algorithm as well as the general IVGA approach. For the experiments we chose randomly a subset of 50 features: the variable set A consisted of 27 FFT features (numbered 1–27 below), B of 7 RGB features (28–34) and C of 16 texture features (35–50). Experiment 1: VQ(A+B+C). A single VQ was used to model the combined set A+B+C of 50 variables. The optimized number of codebook vectors was 44 (number of parameters was thus 44 × 50 = 2200) and the cost was 138115.5. Experiment 2: VQ(A)+VQ(B)+VQ(C). The sets A, B and C were each modeled with a separate VQ. For set A, 16 codebook vectors were used (cost 116397.7), for B, 19 codebook vectors (cost 3921.6) and for C, 30 codebook vectors (cost 25476.9). The total number of parameters was 1045 and the total cost was 145796.2. The model cost improved compared to Experiment 1 which shows that the general IVGA approach is feasible. Experiment 3: IVGAVQ (A+B+C). The feature sets A, B and C were concatenated and IVGAVQ was applied for modeling and grouping. The results are presented in Table 1. Twelve groups were found, and in terms of grouping the result was excellent: each group contained variables from one variable category only. Furthermore, the improved model cost shows that presented IVGAVQ algorithm performs well in the IVGA task. 3 4
We are grateful to the PicSOM project members for kindly providing us with the image data set. This assumption has been made in [6] prior to modeling each variable category with a diﬀerent SOM.
Independent Variable Group Analysis
209
Table 1. Results of the grouping when IVGAVQ was run for the combined data set A+B+C. Negative costs are due to leaving out the constant factor in Eq. 1. Group Variables 1 2 3 4 5 6 7 8 9 10 11 12 Total
1,6 2–5,7–8,11 9–10,13,15–16,18–21,24–25 12,14,17 22,26 23,27 28–34 35–38 39–42 43–44 45–46 47–50
Belong to data set
Cost
A A A A A A B C C C C C
4760.6 33213.3 50997.9 14201.3 8233.6 4920.4 3889.7 6678.2 7248.3 3140.6 2708.1 7215.0 147206.8
Codebook Parameters vectors 7 10 12 8 5 10 20 7 15 9 13 19 135
14 70 132 24 10 20 140 28 60 18 26 76 618
Experiment 4: IVGAVQ (A) + IVGAVQ (B) + IVGAVQ (C) Next, we simulated a situation where the three main groups of variables (A, B and C) are known based on prior information. Now the IVGAVQ algorithm was used to ﬁnd the best possible subgroupings for A, B and C separately. The results are shown in Table 2. Note also that the variable groupings obtained are almost, but not exactly, identical in Experiments 3 and 4. The slight improvement in total cost compared to Experiment 3 is as expected, since computational resources needed not be allocated for separating the groups A, B and C. The results of the four experiments are summarized in Table 3.
5
Conclusions
In this article, the basic idea of IVGA, i.e., grouping variables according to dependences within the data, was presented and motivated, as well as shown to work in practice. An eﬃcient algorithm was suggested for computing IVGA and shown to work rather well on real data. In the experiments vector quantization (VQ) was used to model the dependences within variable groups. In the calculation of the cost function most of the unknown variables are marginalized out and therefore the cost function can reliably be used for model selection. One should keep in mind that we used VQ just as a simple example: it could be replaced with any method as long as the necessary cost function is derived for the method. Each variable group could even be modeled using a diﬀerent method. The presented approach has implications both for constructing better models of large and complex phenomena and for the eﬃcient computation of such models. However, further research is needed to study these issues in depth.
210
Krista Lagus, Esa Alhoniemi, and Harri Valpola Table 2. Results when IVGAVQ was run separately for each set A, B and C.
Variable Group Variables set A A A A B C C C C Total
1 2 3 4 1 1 2 3 4
Cost
Codebook Parameters vectors
1,6 4773.4 2–5, 7–8, 11 33432.8 9–10,13,16,18,20,25 32122.6 12,14–15,17,19,21–24,26–27 46778.0 28–34 3868.6 35–38 6646.1 39–42,45–46 9934.4 43–44 3190.7 47–50 7187.7 147934.3
9 12 13 14 17 8 26 7 11 117
18 84 91 154 119 32 156 14 44 712
Table 3. Summary of results of all experiments. Experiment 1. 2. 3. 4.
VQ(A+B+C) VQ(A) + VQ(B) + VQ(C) IVGAVQ (A+B+C) IVGAVQ (A) + IVGAVQ (B) + IVGAVQ (C)
Total cost #VQs Parameters 138115.5 145796.2 147206.8 147934.3
1 3 12 9
2200 1045 618 712
References 1. J.F. Cardoso. Multidimensional independent component analysis. In Proceedings of ICASSP’98, Seattle, 1998. 2. G. E. Hinton and D. van Camp. Keeping neural networks simple by minimizing the description length of the weights. In Proceedings of the COLT’93, pp. 5–13, Santa Cruz, California, USA, July 26–28, 1993. 3. G. E. Hinton and R. S. Zemel. Autoencoders, minimum description length and Helmholtz free energy. In J. et al, ed., Neural Information Processing Systems 6, San Mateo, CA, 1994. Morgan Kaufmann. 4. A. Hyv¨ arinen and P. Hoyer. Emergence of phase and shift invariant features by decomposition of natural images into independent feature subspaces. Neural Computation, 12(7):1705–1720, 2000. 5. H. Lappalainen and J. W. Miskin. Ensemble learning. In M. Girolami, ed., Advances in Independent Component Analysis, pp. 76–92. SpringerVerlag, Berlin, 2000. 6. E. Oja, J. Laaksonen, M. Koskela, and S. Brandt. Selforganizing maps for contentbased image database retrieval. In E. Oja and S. Kaski, eds., Kohonen Maps, pp. 349–362. Elsevier, 1999. 7. R. S. Zemel. A Minimum Description Length Framework for Unsupervised Learning. PhD thesis, University of Toronto, 1993.
Weight Quantization for Multilayer Perceptrons Using Soft Weight Sharing Fatih K¨ oksal1 , Ethem Alpaydın1 , and G¨ unhan D¨ undar2 1
2
Department of Computer Engineering Department of Electrical and Electronics Engineering Bo˘ gazi¸ci University, Istanbul Turkey
Abstract. We propose a novel approach for quantizing the weights of a multilayer perceptron (MLP) for eﬃcient VLSI implementation. Our approach uses soft weight sharing, previously proposed for improved generalization and considers the weights not as constant numbers but as random variables drawn from a Gaussian mixture distribution; which includes as its special cases kmeans clustering and uniform quantization. This approach couples the training of weights for reduced error with their quantization. Simulations on synthetic and real regression and classiﬁcation data sets compare various quantization schemes and demonstrate the advantage of the coupled training of distribution parameters.
1
Introduction
Since VLSI circuits must be produced in large amounts for economy of scale, it is necessary to keep the storage capacity as low as possible to come up with cheaper products. In an artiﬁcial neural network, the parameters are the connection weights and if they can be stored using fewer bits, storage need will be reduced and we gain from memory. In this work, we try to ﬁnd a good quantization scheme to achieve a reasonable compression ratio for parameters of the network, without signiﬁcantly degrading accuracy. Our proposed method ﬁnds a method to partition the weights of a neural network into a number of clusters so that only one value is used for one cluster of weights. Thus, the actual memory which stores the realnumbered values will be small and weights will be pointers to this memory. Then, the weights in a cluster will point to the same location in the real memory. For example, given an MLP with 10,000 weights that can be grouped into 32 clusters, for each weight only ﬁve bits are used. An analogy is the color map. To get 16 million colors, one requires 24 bits for each pixel. Graphics adapters have color maps, e.g., of size 256, where each entry is 24 bits and is one of 16 million colors. Then using eight bits for each pixel, we can index one entry in the color map. So the storage requirement for an image is one third. Although this means that an image can contain only 256 of the 16 million possible colors, if quantization [1] is done well, there will not be a degradation of quality. Our aim is to do a similar quantization of weights of a large MLP. G. Dorﬀner, H. Bischof, and K. Hornik (Eds.): ICANN 2001, LNCS 2130, pp. 211–216, 2001. c SpringerVerlag Berlin Heidelberg 2001
212
Fatih K¨ oksal, Ethem Alpaydın, and G¨ unhan D¨ undar
The beneﬁt is obvious for digital storage in reducing the memory size. For analog storage of weights, capacitors are perhaps the most popular method, and in such circuits, the area of the capacitors generally dominates the total area. Thus, it is very important to reduce the number of weights. There is a great amount of research on the quantization and eﬃcient hardware implementation of neural networks (speciﬁcally MLP). The aim is to get an MLP with weights which are represented by less number of bits, which consumes less memory at the expense of loss of precision; example studies are given in [2,3,4,5,6]. In our approach which is novel, we do not reduce the precision; the operations are still with full precision. We decrease storage by clustering of the weights. The organization of the paper is as follows: In Section 2, we discuss the possible methods for quantization and divide them into two groups: aftertraining methods and duringtraining methods. Section 3 contains experimental design and the results of these methods, and in Section 4, we conclude and discuss future work.
2
Soft Weight Sharing
We can generally classify the applicable methods for weight quantization into two. The simplest method would be training the neural network normally and then directly applying quantization to the weights of the trained network. We call these aftertraining methods. However, the application of quantization without considering the eﬀect of quantization leads to large error and therefore combining quantization and training is a better alternative which we call duringtraining methods [5]. The methodology we use is soft weight sharing [7] where it is assumed that the weights of an MLP are not constants but are random variables drawn from a mixture of Gaussians M αj φj (w) (1) p(w) = j=1
where w is a weight, αj are the mixing coeﬃcients (prior probabilities), and the component densities φj (w) are Gaussians, i.e., of the form φj (w) ∼ N (µj , σj2 ). The main reason for choosing this type of distribution for weights is its generality and analytical simplicity. There are three types of parameters in the mixture distribution, namely, prior probabilities, αj , means, µj , and variances, σj . Assuming that the weights are independent, the likelihood of the sample of weights is given by W p(wi ) (2) L= i=1
In aftertraining, once the training of the MLP is complete, ExpectationMaximization (EM) method can be used to determine the parameters [8]. It is known that when all priors and variances are equal, the wellknown quantization method kmeans clustering is equivalent to EM. One can even view uniform quantization
Weight Quantization for Multilayer Perceptrons
213
Table 1. The test set error values (average±standard deviation) on the regression dataset for several quantization levels. The unquantized MLP has an error of 4.10±0.06. 1 Bit 2 Bits 3 Bits Uniform 245.33±137.60 104.87±60.33 40.64±23.01 Kmeans 173.34±87.59 57.60±20.97 16.85±13.71 SWS 18.85±2.27 9.99±4.91 4.61±1.05
in this framework where additional to all priors and variances being equal, means are equally spaced and ﬁxed. In duringtraining methods, to couple the training of weights with their quantization, we take the negative log likelihood converting it to an error function to be minimized M ln αj φj (wi ) Ω=− (3) i
j=1
This then is added as a penalty term to the usual error function (mean square error in regression and crossentropy in classiﬁcation) E = E + υΩ
(4)
to get the augmented error function E which is then minimized, e.g., using gradientdescent, to learn both the weights wi , and also the parameters of their distribution, i.e., µj , σj , and αj . We do not give the update equations here due to lack of space but they can be found in [7].
3
Experimental Results
We have used one synthetic regression dataset for visualization and two real datasets for speech phoneme and handwritten digit recognition. All datasets are divided as training and test sets. For all problems, we ﬁrst train MLPs and ﬁnd the optimum number of hidden units and the number of parameters, then they are quantized with diﬀerent number of bits. We have run each model ten times, starting gradientdescent from diﬀerent random initial weights and report the average and standard deviation of error on the test set. The MLP of regression problem has one input, one output and four hidden units thus using 13 weights. Thus without any quantization, we need four bits. We try quantizing with two, four, and eight clusters (Gaussians) corresponding to one, two, and three bits. An example graph of regression function as quantized by soft weight sharing is given in Figure 1. In this ﬁgure, the eﬀect of quantization on the regression function is easily observed. The means and the variances of error with the three methods are given in Table 1. kmeans is an aftertraining method and works better than uniform quantization. Still better though is the result with soft weight sharing (SWS), which is a duringtraining method and clearly demonstrates the advantage of coupled training.
214
Fatih K¨ oksal, Ethem Alpaydın, and G¨ unhan D¨ undar Unquantized MLP
MLP Quantized to 1 bit by SWS
1.5
1.5
1
1
0.5
0.5
0
0
−0.5
−0.5
−1 −10
−5
0
5
10
−1 −10
MLP Quantized to 2 bits by SWS 1.5
1
1
0.5
0.5
0
0
−0.5
−0.5
−5
0
5
0
5
10
MLP Quantized to 3 bits by SWS
1.5
−1 −10
−5
10
−1 −10
−5
0
5
10
Fig. 1. The graphs of regression function; unquantized and quantized by soft weight sharing (SWS) with one, two, and three bits.
The speech phoneme is represented as a 112 dimensional input which contains 200 samples from six distinct classes (six speech phonemes) and is equally divided into two as training and test sets. The MLP has ten hidden units with 1196 parameters and thus unquantized, we need 10 bits. We have used one, two, three, four and ﬁve bits for our performance evaluation. Table 2 reports the simulation results. In Figure 2, we draw the scatter of weights and the ﬁtted probability distribution using soft weight sharing. Note that the probability distribution of weights is clearly a Gaussian distribution centered near zero. Although, there are some papers (e.g. [3]) which claim and assume that the weights of an MLP are distributed uniformly, we face a zeromean normal distribution. This must not be surprising because there are more than 1000 parameters and they are initialized randomly to be close to zero and a large majority are not much updated later on. Note that if we were using a pruning strategy, the connections belonging to a cluster which is centered very close to zero would be the ones to be pruned. The signiﬁcant ones are those which are distant from zero, since they have large error gradient and are updated much larger than the other parameters.
Weight Quantization for Multilayer Perceptrons 1 Bit (2 Gaussians)
2 Bits (4 Gaussians)
0.8
0.4
0.6
0.3
0.4
0.2
0.2
0.1
0 −5
(
0 3 Bits (8 Gaussians))
5
(
0 −5
0.4
0.2
0.3
0.15
0.2
0.1
0.1
0.05
0 −5
(
0 5 Bits (32 Gaussians))
215
5
0 −5
0 4 Bits (16 Gaussians))
0
5
5
0.2 0.15 0.1 0.05 0 −5
0
5
Fig. 2. The distribution of weights of the MLP used for speech phoneme recognition and the ﬁtted Gaussian mixture probability distribution by soft weight sharing. Vertical bars are the weights; circles are the centers of the Gaussians.
The handwritten digit dataset contains 1,200 samples each of which is a 16 × 16 bitmap. The MLP has ten hidden units and we have 2,680 parameters for quantization. The distribution of weight values after the training phase of MLP is again a normal distribution like with the speech data set. Table 3 contains the quantization results. We see in both classiﬁcation datasets that when the number of bits is large, even uniform quantization is good; the advantage of soft weight sharing, i.e., coupled training, becomes apparent with small number of clusters.
4
Conclusions
We propose to use soft weight sharing, previously proposed for improved generalization, for quantizing the weights of an MLP for eﬃcient VLSI implementation and compare it with the previously proposed methods of uniform quantization and kmeans. Our results indicate that soft weight sharing, because it couples the training of the MLP with quantization, leads to more accurate networks at the same level of quantization. Once quantization is done, the results also indi
216
Fatih K¨ oksal, Ethem Alpaydın, and G¨ unhan D¨ undar
Table 2. On the speech phoneme recognition problem, average±standard deviation of number of misclassiﬁcations out of 600 on the test set are given. For comparison, with the unquantized MLP using 10 bits, the misclassiﬁcation error is 40.70±3.79. 1 Bit 2 Bits 3 Bits 4 Bits 5 Bits Uniform 409.29±80.36 376.50±77.40 219.89±55.35 104.19±33.69 55.50±8.96 Kmeans 365.10±82.93 248.00±69.31 123.40±48.74 51.59±8.66 45.50±5.93 SWS 387.50±74.10 209.39±37.55 89.00±45.40 52.00±9.85 44.79±5.92 Table 3. On the handwritten digit recognition problem, average±standard deviation of number of misclassiﬁcations out of 600 of the MLP on the test set are given. For comparison, with the unquantized MLP, the misclassiﬁcation error is 23.50±3.58. 1 Bit 2 Bits 3 Bits 4 Bits 5 Bits Uniform 526.70±21.75 512.70±42.37 441.70±70.21 95.19±37.08 29.29±5.67 Kmeans 448.39±114.14 263.29±75.21 58.79±27.43 27.70±5.38 24.50±2.80 SWS 397.20±139.41 244.19±53.66 64.19±37.24 30.70±7.28 27.90±6.47
cate the saliency of weights which can be used further to prune the unnecessary connections; we leave this as future work.
Acknowledgment This work is supported by Grant 00A0101D from Bo˘ gazi¸ci University Research Funds.
References 1. Gersho, A. and R. Gray, Vector Quantization and Signal Compression Norwell, MA:Kluwer, 1992. 2. Choi, J. Y. and C. H. Choi, “Sensitivity Analysis of Multilayer Perceptron with Differentiable Activation Functions,” IEEE Transactions on Neural Networks, Vol. 3, pp. 101–107, 1992. 3. Xie, Y. and M. A. Jabri, “Analysis of the Eﬀects of Quantization in MultiLayer Neural Networks Using a Statistical Model,” IEEE Transactions on Neural Networks, Vol. 3, pp. 334–338, 1992. 4. Skaue, S., T. Kohda, H. Yamamato, S. Maruno, and Y. Shimeki, “Reduction of Required Precision Bits for Back Propagation Applied to Pattern Recognition,” IEEE Transactions on Neural Networks, Vol. 4, pp. 270–275, 1993. 5. D¨ undar, G. and K. Rose, “The Eﬀects of Quantization on Multi Layer Neural Networks,” IEEE Transactions on Neural Networks, Vol. 6, pp. 1446–1451, 1995. 6. Anguita, D., S. Ridella and S. Rovetta, “Worst Case Analysis of Weight Inaccuracy Eﬀects in Multilayer Perceptrons,” IEEE Transactions on Neural Networks, Vol. 10, pp. 415–418, 1999. 7. Nowlan, S. J. and G. E. Hinton, “Simplifying Neural Networks by Soft Weight Sharing,” Neural Computation, Vol. 4, pp. 473–493, 1992. 8. Alpaydın, E. “Soft Vector Quantization and the EM Algorithm,” Neural Networks, Vol. 11, pp. 467–477, 1998.
VotingMerging: An Ensemble Method for Clustering Evgenia Dimitriadou, Andreas Weingessel, and Kurt Hornik Institut f¨ ur Statistik, Wahrscheinlichkeitstheorie und Versicherungsmathematik Technische Universit¨ at Wien Wiedner Hauptstraße 8–10/1071 A1040 Wien, Austria {dimi,weingessel,hornik}@ci.tuwien.ac.at http://www.ci.tuwien.ac.at
Abstract. In this paper we propose an unsupervised votingmerging scheme that is capable of clustering data sets, and also of ﬁnding the number of clusters existing in them. The voting part of the algorithm allows us to combine several runs of clustering algorithms resulting in a common partition. This helps us to overcome instabilities of the clustering algorithms and to improve the ability to ﬁnd structures in a data set. Moreover, we develop a strategy to understand, analyze and interpret these results. In the second part of the scheme, a merging procedure starts on the clusters resulting by voting, in order to ﬁnd the number of clusters in the data set1 .
1
Introduction
Clustering is the partitioning of a set of objects into groups so that objects within a group are “similar” and objects in diﬀerent groups are “dissimilar”. Thus, the purpose of clustering is to identify “natural” structures in a data set. In real life clustering situations usually the following main problems are encountered: First, the true structure, especially the number and shapes of the clusters, is unknown. Second, diﬀerent cluster algorithms and even multiple replications of the same algorithm result in diﬀerent solutions due to random initializations and stochastic learning methods. Moreover, there is no clear indication which of the diﬀerent solutions of the replications of the algorithm is the best one. Every cluster algorithm tries to optimize some criterion, like minimizing the meansquare error. If the task of clustering is to compress the data by mapping every data point to a prototype, then the minimization of an appropriate error measure is the right thing to do, but if the task is to ﬁnd structures in the data, these optimization criterion might help to ﬁnd a good solution, but they are 1
This piece of research was supported by the Austrian Science Foundation (FWF) under grant SFB#010 (‘Adaptive Information Systems and Modeling in Economics and Management Science’).
G. Dorﬀner, H. Bischof, and K. Hornik (Eds.): ICANN 2001, LNCS 2130, pp. 217–224, 2001. c SpringerVerlag Berlin Heidelberg 2001
218
Evgenia Dimitriadou, Andreas Weingessel, and Kurt Hornik
not necessarily the right measures to optimize. Especially, if the (generally nonGaussian) probability distribution of the given data set is completely unknown, it is not clear which criterion to use. To tackle these problems, we handle results of various runs by using the existing idea of voting in classiﬁcation [1,3]. We develop an algorithm which allows a combination between several results of a cluster algorithm (voting) resulting in a common partition. The idea is that the voting procedure can be applied to any existing algorithm that has instable results. The result of our voting is a “fuzzy” partition of the data. We propose the idea that there are cases where an input point has a structure with a certain degree of conﬁdence and may belong to more than one cluster with a certain degree of “belongingness”. With this method the output of several single classiﬁers can be combined so as to reduce the variance of the error between the diﬀerent runs and to get an overall decision made by the combined classiﬁers. Additionally, voting can react to the tendency of every algorithm to create clusters with a speciﬁc kind of structure (e.g., kmeans is creating round clusters) and may result also to non convex ones. In the following steps, we take advantage of that “fuzzy” partition of the data to introduce some new measures for handling and understanding these results, and to develop a method for ﬁnding the right number of clusters in a data set. This technique is followed by a merging procedure, where clusters resulting from voting are merged according to the highest probability of their data points to belong to another cluster. This procedure continues until every cluster is merged, and then a decision according to some criteria is taken in order to specify the optimal number of clusters in the data set. Consequently, our VotingMerging Algorithm (VMA) represents a complete scheme of a clustering algorithm, able to partition data points of a set in clusters and also to ﬁnd the optimal number of classes existing in the set. The paper is organized as follows. In Sections 2 we present the VotingMerging Algorithm (VMA) and its implementation. Section 3 demonstrates our experimental results and some comments on them. Finally a conclusion of the paper is given in Section 4.
2
Description of the VMA
Generally, the VMA is a scheme consisting of 3 procedures, namely the repeated runs of a clustering algorithm, the voting procedure receiving these results, and ﬁnally the merging procedure which concludes ﬁnding the number of clusters in a data set. The 3 levels of the algorithm are applied sequentially. They do not interfere with each other, but they just receive the results from the previous levels. No feedback process happens, and the algorithm terminates after the completion of all procedures. 2.1
Voting Procedure
In classiﬁcation, there is a ﬁxed set of labels which are assigned to the data. Therefore, we can compare for every input x the results of the various classiﬁers,
VotingMerging: An Ensemble Method for Clustering
219
i.e., the labels assigned to x and apply a voting procedure between these results. Things are diﬀerent in clustering, because diﬀerent runs of a cluster algorithm can result in diﬀerent clusters, which might partition the input data in totally diﬀerent ways. Thus, there is the problem to decide which cluster of one run corresponds to which in another run. As an example, suppose that we have the results of several runs i of cluster algorithms which partition our data into 3 clusters Cij , j = 1, 2, 3. When combining the ﬁrst two runs we have to decide which of the clusters C11 , C12 , C13 of the ﬁrst run is similar to which cluster C21 , C22 , C23 of the second run, if there is any similarity at all. Note that as opposed to classiﬁcation the numbers 1, 2, 3 are arbitrarily assigned to the clusters and their order can be interchanged. If we combine more than 2 runs, the additional diﬃculty arises that the similarity relation is not transitive. That is, cluster C11 of the ﬁrst run might be similar to cluster C22 of the second run which might again be similar to cluster C31 of the third run and so on, but this does not mean that C31 is again similar to C11 . It might even turn out that C31 is more similar to C12 for example. D_3j 2/3
1/3
D_2j 1/2
1/2
C_ij
x
Fig. 1. The Voting Procedure
Since there seems to be no obvious way for combining more than two runs at once, we developed the following procedure, which is depicted by the network in Figure 1. Our input data x is clustered several times, the second layer (Cij ) of Figure 1 symbolizes three runs with three clusters each. Let Cij denote the jth cluster in the ith run and Dij denote the jth cluster in the combination of the ﬁrst i runs. Our procedure works on an iterative basis. The ﬁrst two runs are combined in the following way. First a mapping between the clusters of the two runs is deﬁned. To do this we compute for each cluster C2j how many percent of its points have been assigned to which cluster C1k . Then, the two clusters with the highest percentage of common points are assigned towards each other. Of the remaining clusters, again the two with the highest similarity are matched and so on. After renumbering the clusters of the second run so that C2j corresponds to C1j , ∀j, we assign the points to the common clusters D2j in the following way. If a data point x has been assigned to both C1j and C2j it will be assigned to k in the D2j . If x has been assigned to C1j in the ﬁrst run and to C2k with j =
220
Evgenia Dimitriadou, Andreas Weingessel, and Kurt Hornik
second run then it will be assigned to both, D2j and D2k , with weight 0.5. This step is shown in the connection between the second layer (Cij ) and third layer (D2j ) in Figure 1. Every cluster of D2j receives input from exactly one cluster of the ﬁrst run and one of the second run. The weights of these connections are set to 12 . If we have already combined the ﬁrst n runs in a common clustering Dnj we add an additional run C(n+1)j by combining it with Dnj in the same way as for two runs, but give weight n/(n + 1) to the common clusters of the ﬁrst n runs and weight 1/(n + 1) to the new cluster. Note that this voting procedure gives equal weights to all repetitions of the cluster algorithm and there is no additional gating network [9] which learns to decide between the diﬀerent results. The reason is that, as opposed to supervised learning, we can not decide which of the various results is the best. 2.2
The Partition Resulting from the Voting Procedure
For the results of the voting procedure the data points are typically not uniquely assigned to one cluster, but there is a “fuzzy” partition. That is, after voting of N runs we get for every data point x and every cluster j a value DN j (x) which gives the fraction of times x has been assigned to cluster j. For interpreting the ﬁnal result we can either accept this fuzzy decision or assign every data point x to that cluster k = argmaxj DN j (x) where it has been assigned most often. We deﬁne the sureness of a data point as the percentage of times it has been assigned to its “winning” cluster k, that is sureness(x) = maxj DN j (x). Then we can not only see how strong a certain point belongs to a cluster but we can also compute the average sureness of a cluster (Avesure) as the average sureness of all the points of a cluster that belong to it, i.e., avesure(k) = meanx∈k DN k (x). In this way we notice which clusters have a clear data structure and which not. 2.3
Merging Procedure
As already mentioned in the previous section, after voting every data point x belongs to more than one cluster j with a certain degree of “belongingness” DN j (x). If we set n(k, j) := meanx∈k DN j (x) we get a measure of how strong the points of cluster k belong to cluster j. Thus, n(k, j) deﬁnes a (nonsymmetric) neighborhood relation between two clusters. So, we can say that a cluster l is the closest cluster to cluster k, if n(k, l) = maxj=k n(k, j)2 . We can use this neighborhood relation to develop a merging procedure that starts with many clusters and merges clusters which are closest to each other. More speciﬁcally, two clusters k and l are merged together, if k is the closest cluster to l and l is the closest cluster to k. Additionally, we merge “chains” of clusters. That is, a set of clusters k1 , k2 , . . . , kn is merged if ki+1 is the closest cluster to ki (i = 1, . . . , (n − 1)) and k1 is the closest cluster to kn . 2
Note that cluster k is not necessarily the closest cluster to cluster l.
VotingMerging: An Ensemble Method for Clustering
221
After the merging step, the sureness of the points and avesure of the clusters are recomputed by adding up the values of the merged clusters. Then, the merging step repeats until some stopping criterion is met. The criterion that derives directly from this procedure is that merging will stop when all clusters end up to have an avesure of 100%. Since in practice this condition is not always possible to be realized, for example due to number precision problems, we decide to stop the merging procedure when the avesure of every cluster is greater than 99%. That is, on average every point is assigned less than once to a “wrong” cluster, if voting has been applied with 100 runs. Thus, we have in that case a signiﬁcant decision that every point belongs to the “right” cluster. In this case the probability of a cluster to be merged with another is extinguished, which means that the procedure terminates normally. Unfortunately, this happens only in the case where a simple and clear structure, especially a non overlapping one, exists in a data set. In other situations clusters are able to reach a “very sure” structure (of more than 90%) but not the one of 100% avesure. Note that clusters with one data point are not considered as clusters but the point is partitioned to the next cluster with the highest probability to “win” it. That is due to the fact that, obviously, one point clusters can become very easily 100% “sure” and stop merging. Since it is non realistic in realworld data sets to reach a 100% “sure” solution, it is still probable to happen that the “real” clusters existing in a data set will increase all the time their avesure during their construction by the merging procedure and the rest will stay quite “unsure” in comparison. The problem logically arising is that after reaching by merging the “real” clusters existing in the data set, every merging step will also continue increasing the mean avesure of all the clusters and consequently it is not always clear when the merging should be stopped. However, this increase that takes place is not signiﬁcant, because the clusters are already to that point quite “sure”. Thus, to tackle these problems we suggest the following criterion (see [2]): devsure(i) := numsure(i) − numsure(i − 1) − numsure(i + 1) − numsure(i) where numsure(i) is the sum of the sureness(x) of all points in all the n clusters in the ith step of the merging. This second order diﬀerence shows how much the solution for the n clusters existing in the i step of merging deviates from the general increase (when n decreases) of the sureness. The solution with the maximum value for devsure(i) is the one whose cluster structure found by voting and merging is the optimal for the algorithm. 2.4
Implementation
We used diﬀerent clustering algorithms in our experiments, like kmeans (also known as LBG, [4]) as an example of an oﬀline algorithm, hard competitive learning as an online version of kmeans (for example [10]) and neural gas [6].
222
Evgenia Dimitriadou, Andreas Weingessel, and Kurt Hornik
The experiments presented here consist of 100 runs of one clustering algorithm with a big number of clusters speciﬁed in the beginning. There is no formula to determine the value of this number, but experimental results show that it should be bigger proportional to the size of the data set. This is logical, in the way that if a small data set is clustered with a big number of clusters then it will be unavoidable that we will have many clusters of few points (for example 2 points) and also many empty ones. In the case of too many clusters we have the problem of having in the beginning of the algorithm structures being 100% “sure” with only very few points (for example 2), something that does not correspond to reality, when in the case of many empty clusters, it makes simply the decision of clustering with a big number of clusters meaningless. Moreover, in the case of big data sets it is proposed to cluster them with many centers, to avoid points belonging to diﬀerent clusters but being close neighbors to be clustered together. For the same reason the number of the clusters, with which a clustering algorithm runs, should be also big the more overlapping the structures of the data sets appear to be, for example generally the realworld data sets. After this step a voting between these runs follows and then a merging according to voting results. Merging stops when we reach the 2 clusters in a data set. Since we are for our data sets aware of the true cluster structure, we can treat the result of a cluster algorithm as a classiﬁcation problem, [8], although we do not use the class information during clustering. That is, we can compute how many points have been assigned by the cluster algorithm to the right cluster. All our experiments have been performed in R, a system for statistical computation and graphics, which implements to wellknown Slanguage for statistics. R runs under a variety of Unix platforms (such as Linux) and under Windows95/98/NT. It is available freely via CRAN, the Comprehensive R Archive Network, whose master site is at http://www.Rproject.org.
3
Experimental Results
An artiﬁcial data set as well as a realworld set are used to demonstrate the performance of the VMA. A description of these data sets, some ﬁgures, as well as some comments on the results, follow in order to make clear and demonstrate the performance of the algorithm. Each voting run has been performed with the results of 100 new cluster runs. 3.1
A 2Dimensional Example
This data set is also a 2dimensional one, and it is a simpliﬁcation of the one described and used in [5]. It consists of 2000 data points belonging to 2 elongated, curved clusters with opposite curvature (see Figure 2(a)). The two curves with 1000 data points each are generated in the following way; one according to y = 2(x − 1)2 + where the x are uniformly distributed in [0, 2] and is a normally distributed noise term with zero mean and standard deviation 0.1; the other is y = 3 − 2(x − 2)2 + with x uniformly distributed in [1, 3] and as above. This data set is not linearly separable.
VotingMerging: An Ensemble Method for Clustering
223
3.0 2.5 2.0 1.5 1.0 0.5 0.0
0.0
0.5
1.0
1.5
2.0
2.5
3.0
After voting between 100 results of hard competitive learning runs (starting with 30 clusters), merging also results partitioning perfectly the data points to their clusters (100% classiﬁcation rate). Here, since we reach 100% “avesure” clusters merging stops without the help of the criterion.
0.0
0.5
1.0
1.5
2.0
2.5
3.0
(a) 2Dimensional Data Set
0.0
0.5
1.0
1.5
2.0
2.5
3.0
(b) Clustering Result
Fig. 2. Elongated Curves
Also by this data set, hard competitive single runs, with the number of the 2 clusters speciﬁed, can never ﬁnd the two curved clusters by misclassifying 10% of the data points on average (a typical result of hard competitive in Figure 2(b)). 3.2
RealWorld Data Set: DNA
We apply a clustering algorithm to the DNA data set (see [7]). Basically, the DNA set is considered as a classiﬁcation problem because the true classes are known. But because there are no well established benchmark data sets for cluster problems, we treat the problem as a clustering one. It consists of 2000 data points (splice junctions) on a DNA sequence at which “superﬂuous” DNA is removed during protein creation. The data points are described by 60 indicator binary variables and the problem is to recognize the 3 classes (ei, ie, neither), i.e., the boundaries between exons (the parts of the DNA sequence retained after splicing) and introns (the parts of the DNA sequence that are spliced out). VMA also here succeeds in detecting the right number of clusters in the data set with a classiﬁcation rate of 85.56%. The voting procedure has been performed on 100 runs of the hard competitive algorithm (starting with 90 clusters).
4
Conclusion
In this paper we present an unsupervised scheme, able to cluster data points in a given data set and also to ﬁnd the number of clusters existing. The voting procedure of the algorithm combines the results of several independent cluster runs by voting between their results. This allows us to deal with the problem
224
Evgenia Dimitriadou, Andreas Weingessel, and Kurt Hornik
of local minima of cluster algorithms and to ﬁnd a partition of the data which is supported by repeated applications of the cluster algorithm and is not inﬂuenced by the randomness of initialization or the cluster process itself. Moreover, we develop a strategy of analyzing and interpreting the results of the voting algorithm in such a way that existing clusters as well as data points belonging to a cluster can be identiﬁed as “sure” or “clear” concerning their structure and their partition respectively. The next part of the algorithm consists of a merging process of the clusters resulting by voting, while a criterion is computed in order to make the decision in favor of a speciﬁc number of clusters. Concluding, the VMA does not suﬀer from thresholds deﬁnitions, since only one parameter need to be deﬁned and it diminishes to the minimum possible the instabilities of other clustering methods and can support non convex solutions.
References 1. Leo Breiman. Bagging predictors. Machine Learning, 24:123–140, 1996. 2. Evgenia Dimitriadou, Andreas Weingessel, and Kurt Hornik. Voting in clustering and ﬁnding the number of clusters. In H. Bothe, E. Oja, E. Massad, and C. Haefke, editors, Proceedings of the “International Symposium on Advances in Intelligent Data Analysis (AIDA 99)” (“International ICSC Congress on Computational Intelligence: Methods and Applications (CIMA 99)”, pages 291–296. ICSC Academic Press, 1999. 3. Yoav Freund and Robert E. Schapire. A decisiontheoretic generalization of online learning and an application to boosting. Lecture Notes in Computer Science, 904, 1995. 4. Yoseph Linde, Andrs Buzo, and Robert M. Gray. An algorithm for vector quantizer design. IEEE Transactions on Communications, COM28(1):84–95, January 1980. 5. Jianchang Mao and Anil K. Jain. Artiﬁcial neural networks for feature extraction and multivariate data projection. IEEE Transactions on Neural Networks, 6(2):296–317, March 1995. 6. Thomas M. Martinetz, Stanislav G. Berkovich, and Klaus J. Schulten. “NeuralGas” network for vector quantization and its application to timeseries prediction. IEEE Transactions on Neural Networks, 4(4):558–569, July 1993. 7. D. Michie, D. J. Spiegelhalter, and C. C. Taylor, editors. Machine Learning: Neural Statistical Classiﬁcation. Ellis Horwood, 1994. 8. HansJoachim Mucha. Clusteranalyse mit Mikrocomputern. Akademie Verlag, 1992. 9. A. S. Weigend, M. Mangeas, and A. N. Srivastava. Nonlinear gated experts for time series: Discovering regimes and avoiding overﬁtting. International Journal of Neural Systems, 6:373–399, 1995. 10. Lei Xu, Adam Krzyzak, and Erkki Oja. Rival penalized competitive learning for clustering analysis RBF net and curve detection. IEEE Transactions on Neural Networks, 4(4):636–649, July 1993.
The Application of Fuzzy ARTMAP in the Detection of Computer Network Attacks James Cannady1 and Raymond C. Garcia2 1
School of Computer and Information Sciences Nova Southeastern University Fort Lauderdale, FL 33314
[email protected] 2 School of Electrical and Computer Engineering Georgia Institute of Technology Atlanta, GA 30332
[email protected] Abstract. The timely and accurate detection of computer and network system intrusions has always been an elusive goal for system administrators and information security researchers. Existing intrusion detection approaches require either manual coding of new attacks in expert systems or the complete retraining of a neural network to improve analysis or learn new attacks. This paper presents a new approach to applying adaptive neural networks to intrusion detection that is capable of autonomously learning new attacks rapidly using feedback from the protected system.
Introduction The timely and accurate detection of computer network attacks is one of the most difficult problems facing system administrators. The individual creativity of attackers, the wide range of computer hardware and operating systems, and the everchanging nature of the overall threat to targeted systems have contributed to the difficulty in effectively identifying intrusions. While intrusion detection systems (IDS) generally rely on rulebased expert systems to identify attacks a limited amount of research has been conducted on the application of neural networks to address the inherent weaknesses in rulebased approaches. Debar (1992) and Fox (1990) proposed neural networks as alternatives to the statistical analysis component of anomaly detection systems. These neural networks identify the typical characteristics of system users and identify statistically significant variations from the user's established behavior. Cannady (1998) demonstrated the use of multilevel perceptron/SOM hybrid neural networks in the identification of computer attacks and Bonifacio (1998) demonstrated the use of a neural network in an integrated detection system. However, each of these approaches required the complete retraining of the neural networks to learn new attacks. In Cannady (2000) a cerebellar model articulation controller (CMAC) neural network (Albus, 1975) was applied in the detection of denialofservice (DoS) attacks. While the CMAC approach was very effective in detecting the DoS attacks, this category of network intrusion is relatively simplistic. There are numerous other types of network attacks that are much more G. Dorffner, H. Bischof, and K. Hornik (Eds.): ICANN 2001, LNCS 2130, pp. 225–230, 2001. SpringerVerlag Berlin Heidelberg 2001
226
James Cannady and Raymond C. Garcia
difficult to detect. The detection of these complex forms of network attack may require some form of clustering to accurately identify the large variety of potential attacks. This paper presents the results of a research effort that investigated the application of a Fuzzy ARTMAP (FAM) neural network in the detection of networkbased attacks. The objective of this research was to evaluate the ability of a FAM neural network to autonomously learn new attack patterns in a simulated network data stream. The inability of existing systems to autonomously identify new complex attack patterns increases the longterm cost of the systems due to the requirement for dedicated personnel to identify and implement the necessary updates. However, more significant is the fact that the lack of autonomous learning of complex network attacks by existing approaches results in a IDS that is only as current as the most recent update and therefore becomes progressively more ineffective over time.
Fuzzy ARTMAP Approach The FAM neural network architecture was selected for this application primarily because of the enhanced clustering capabilities of the algorithm. There is also an ease of use with FAMs, they have a small number of parameters and require no problemspecific system crafting or choice of initial weight values. Learning of each pattern occurs as it is received online as opposed to performing an offline optimization of a criterion function. FAMs provide mathematical precision in terms of ART concepts that outperform other softcomputing techniques. The strengths include autonomous learning, recognition, and prediction with fast learning of rare events, stable learning of large nonstationary databases, efficient learning of morphologically variable events, and associative learning of manytoone and onetomany maps. A detailed overview of the FAMs learning process is presented in (Carpenter, 1992). By definition, resonance occurs if the match function meets the vigilance criterion. The match function is defined as
I ∧ wJ
I where I is the input
activity vector and wJ is a vector of adaptive weights. The vigilance criterion itself is
I ∧ wJ I ≥ ρ where ρ ∈ [0,1] is the vigilance parameter. For both definitions, ∧ represents the fuzzy AND minimum operation. Once the vigilance criterion is
met, the weight vector is updated in the following way:
wJ(
new )
(
= β I ∧ wJ(
old )
) + (1 − β ) w(
old ) J
(1)
The initial series of experiments were conducted to evaluate the ability of the FAM algorithm to distinguish typical event patterns. A Javabased prototype application was created based on the description of the FAM algorithm provided in (Carpenter, 1992). A networkmonitoring tool was used to collect a data sample of 100,000 packets from an Ethernet network on the campus of Nova Southeastern University. Thirtyseven attacks sequences were included in the data sample. These attacks included several forms of denial of service attacks, kernel panic attacks, and attacks related to the unauthorized access of systemlevel resources by an intruder. A subgroup of 10,000 packets was collected from the initial sample, including examples of the 37 different forms of network attacks. The data were arranged in the order that
The Application of Fuzzy ARTMAP in the Detection of Computer Network Attacks
227
they were received into data vectors consisting of 10 elements each. The input vectors were ordered sets of normalized floating point numbers that represented types of Ethernet network packets, (e.g., ping, telnet, FTP, etc.). These packet sequences were then categorized within a range from 0.0 (normal data) to .99 (attack sequence) based on the effect of the sequence on the state ( s ) of the protected system. Binary representations of the packet sequence categories were used as inputs to the ARTB module and the packet sequence vectors were used as inputs to the ARTA module. The evaluation of the FAM algorithm was based on the average mean difference between the output of the ARTA module and the desired output for each of the packet sequence vectors. The FAM algorithm that was implemented in the prototype application achieved a average mean difference of 0.0066 in the test of the data sample. The evaluation demonstrated the ability of the FAM algorithm to accurately identify typical event sequences from a network data stream. The second series of tests were conducted to evaluate the ability of the FAM architecture to autonomously learn and identify attack sequences in a network data stream. While the FAM architecture demonstrated the ability to accurately identify sample network events, a modern computer network data stream is an extremely complex and dynamic environment. The need for an autonomous learning capability in this application is based on the wide variety of potential attacks and the need to recognize new threats quickly. As a result, the FAM architecture that was used in these experiments was modified to incorporate a limited reinforcement learning capability. The ARTB input was replaced with a variable (f) that represented the feedback from the computer system that was being protected by the application. Each feedback value was based on the combined effect of the input packets on a variety of system state indicators that could be monitored to provide an indication of an attack. These characteristics included the current CPU load, the available network bandwidth, the available memory, and the number of active connections. The feedback variable also included an indication of the relative compliance of the current system state with an established security policy. For example, the policy of the protected system in these experiments included the prohibition against the modification of the system log (syslog) files. An attacker who is able to write to the syslog can overload the buffer with erroneous data and render the syslog daemon useless. As a result, any attempts to modify the syslog by a nonroot process is immediately flagged as a violation of the security policy and incorporated into the feedback variable. The system state (s) of the protected host was in the range 0.0 (system stopped) to 0.99 (system optimal). f was computed by using the inverse of the state of the protected host (1 – s ). While receiving normal network data the state of the protected host should be nominal, (e.g., 0.75  0.99), and the corresponding ARTa output should be small, (e.g., 0.0 – 0.25). However, during an attack the state of the protected host becomes degraded as the system reacts to the intrusion and s is reduced. The use of f provided the FAM with a nonsupervised method of learning new patterns. This is a particularly critical capability for a system that must quickly identify new attack patterns before significant damage is done to the protected system. The revised FAM approach was initially tested by initializing the FAM weights and using a simulated network data stream that contained 100,000 packets. The data sample included single instances of the 37 attack sequences that were included in the AB initial evaluation. The FAM utilized a map field (F ) vigilance parameter of 0.1. In this case a low vigilance parameter was chosen to limit the number of ARTa nodes that
228
James Cannady and Raymond C. Garcia
are created. With a high vigilance parameter the number of ARTa nodes in an actual network implementation could quickly exceed the available memory resources of the system running the FAM. The FAM was evaluated based on the average mean difference between the ARTa output for each of the 37 attack sequences and the f value for each of the input vectors. A radial basis function (RBF) neural network was used to simulate the state (s) of the protected system. The RBF neural network was trained to provide an output for each of the input vectors that were consistent with the expected state of a Linuxbased host that was receiving the same network data stream. After the FAM processed the 100,000 data samples the application achieved an average mean difference of 0.335. This error was significantly higher than the average error rate of 0.15 in commercial intrusion detection systems, (Bonifacio, 1998). While the initial evaluation demonstrated some capabilities for online learning of a priori attacks, the average mean error would be unacceptable in an actual network implementation. In an effort to enhance the detection capability of the FAM approach an adaptive AB F vigilance parameter was used in place of the static 0.1 vigilance parameter that AB was used in the initial evaluation. In the modified algorithm f was used as the F vigilance parameter:
I ∧
wJ
/ I
> f
(2)
AB
The variability of the F vigilance parameter was considered as a viable method of allowing a system to increase or decrease the rate in which new ARTa clusters are AB created in response to circumstances in a dynamic environment. By using f as F vigilance parameter the FAM implemented in this research was designed to rapidly learn new attack patterns when the state of the protected host was degraded and create nodes representing new patterns at a slower rate when the state was nominal. This results in an adaptive learning process that responds to the current level of potential threat to the protected host. However, to avoid the potential adverse impact of the performance of the application that may be caused by a constantly changing vigilance parameter a limited level of adaptability was utilized. In the revised application the AB static F vigilance parameter is replaced with f when the state of the protected system dropped below 0.25. In this way the FAM retained the ability to limit the creation of new ARTa nodes during normal activity, while it also possesses the flexibility of recognizing new attack patterns quickly when the state of the protected system is degraded. The revised FAM algorithm was retested using the same data set after the weights were initialized. In this test the FAM achieved a mean average difference of 0.043. AB This was a significant improvement over the use of a static F vigilance parameter, (Fig 1), and a level of detection accuracy that is higher than currently available commercial applications.
Conclusions and Future Work Research and development of IDSs has been ongoing since the early 1980's and the challenges faced by designers increase as the targeted systems because more diverse and complex. The results demonstrate the potential for a powerful new analysis component of a complete IDS that would be capable of identifying priori and a priori
The Application of Fuzzy ARTMAP in the Detection of Computer Network Attacks
229
network attack patterns. Based on the results of the tests that were conducted on this approach there were several significant advances in the detection of network attacks:
Detection Responses 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 1 2
3
4 5 6
7 8
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 A t t a c k S e q ue nc e s
St at ic Vigiliance Paramet er
Adapt ive Vigiliance Paramet er
Fig. 1. Attack Detection Test Results
!
Online learning of complex attack patterns – The approach has demonstrated the ability to rapidly learn new complex attack patterns without the complete retraining required in other neural network approaches. This is a significant advantage that could allow the IDS to continually improve its analytical ability without the requirement for external updates.
!
Extremely accurate in identifying priori attack patterns – The use of the dynamic AB F vigilance parameter resulted in an average error of 0.04, compared with an average error of 0.15 in existing IDSs. Because other information security components rely on the accurate detection of computer attacks the ability to accurately identify network events could greatly enhance the overall security of computer systems.
!
Adaptive learning algorithm – The use of an adaptive F vigilance parameter, based on the current state of the protected host, provides the ability to rapidly learn new attacks, thereby significantly reducing learning time in periods when rapid attack identification is required.
AB
The results of the tests of this approach shows significant promise, and our future work will involve the development of a fullscale integrated intrusion detection and response system that will incorporate the Fuzzy ARTMAPbased approach as the analytical component.
230
James Cannady and Raymond C. Garcia
References Albus, J.S. (1975, September). A New Approach to Control: The Cerebellar Model Articulation Controller (CMAC). Transactions of the ASME. Bonifacio, J.M, Cansian, A.M., de Carvalho, A., & Moreira, E. (1998). Neural Networks Applied in Intrusion Detection. In Proceedings of the International Joint Conference on Neural Networks. Carpenter, G.A., Grossberg, S., Markuzon, N., Reynolds, J.H., and Rosen, D.B. (1992). Fuzzy ARTMAP: A Neural Network Architecture for Incremental Supervised Learning Of Analog Multidimensional Maps, In IEEE Transactions on Neural Networks, 3, 698713. st Cannady, J. (1998). Applying Neural Networks to Misuse Detection. In Proceedings of the 21 National Information Systems Security Conference. Cannady, J. (2000, July). Applying CMACbased Online Learning to Intrusion Detection. In Proceedings of the 2000 IEEE/INNS Joint International Conference on Neural Networks. Debar, H. & Dorizzi, B. (1992). An Application of a Recurrent Network to an Intrusion Detection System. In Proceedings of the International Joint Conference on Neural Networks. Fox, Kevin L., Henning, Rhonda R., & Reed, Jonathan H. (1990). A Neural Network Approach Towards Intrusion Detection. In Proceedings of the 13th National Computer Security
Transductive Learning: Learning Iris Data with Two Labeled Data Chun Hung Li and Pong Chi Yuen Department of Computer Science, Hong Kong Baptist University, Hong Kong {chli,pcyuen}@comp.hkbu.edu.hk
Abstract. This paper presents two graphbased algorithms for solving the transductive learning problem. Stochastic contraction algorithms with similarity based sampling and normalized similarity based sampling are introduced. The transductive learning on a classical problem of plant iris classiﬁcation achieves an accuracy of 96% with only 2 labeled data while previous research has often used 100 training samples. The quality of the algorithm is also empirically evaluated on a synthetic clustering problem and on the iris plant data.
1
Introduction
In recent years, important classiﬁcation tasks have emerged with enormous volume of data. The labeling of a signiﬁcant portions of the data for training is either infeasible or impossible. A number of approaches have been proposed to combine a set of labeled data with unlabeled data for improving the classiﬁcation rate. The cotraining approach has been proposed to solve the problem of web page classiﬁcation where the web pages can be represented by two independent representations [1]. Subsequently, a similar cotraining method is invented for combining labeled and unlabeled data by cotraining with two learning algorithms [2]. Instead of using two representations of the data, this cotraining algorithm uses two learning algorithms. The naive Bayes classiﬁer and the EM algorithm have been combined for classifying text using labeled and unlabeled data [3]. A modiﬁed support vector machine and nonconvex quadratic optimization approaches have been studied for optimizing semisupervised learning [4]. The transductive learning is suggested to simplify learning problem where there is a restricted amount of data available [5]. One of the essential properties of transductive learning is the estimation of the function of interest at only some given points rather than the estimation of the complete decision surface. The support vector machines have been extended with transductive inference to classify text [6]. The transductive learning is also applied to improve the color tracking in video sequences [7]. Graph based clustering has received a lot of attention recently. A factorization approach has been proposed for clustering[8], the normalized cuts have been proposed as a generalized method for clustering [9] and [10]. In this paper, we investigated the use of stochastic graphbased sampling approach for solving G. Dorﬀner, H. Bischof, and K. Hornik (Eds.): ICANN 2001, LNCS 2130, pp. 231–236, 2001. c SpringerVerlag Berlin Heidelberg 2001
232
Chun Hung Li and Pong Chi Yuen
the transductive learning problem. Graphbased clustering is also shown to be closely related to similaritybased clustering [11].
2
Graph Based Clustering
The data is modeled using an undirected weighted graph G(V, E) consisting of a set of vertices and a set of edges. The data to be clustered are represented by vertices in the graph. The weights of the edge of vertices represents the similarity 2 between the object indexed by the vertices. The similarity matrix S ∈ RN where sij represents the similarity between the vector xi and xj . A popular choice of the similarity matrix is of the form xi − xj  sij = exp − (1) σ2 where x is an euclidean distance of x and σ is a scale constant. The exponential similarity function have been used for clustering in [8], [9] and [10]. We can also employ the knearest neighbour graph where the similiarity matrix includes the edge (i, j) only if j is the one of the kth nearest neighbour of i. This reduces the number of the edges in the graph from N 2 to kN . The minimum cut algorithm partitions the vertices in the graph G into two disjoint sets (A, B), that minimizes the following objective function, Sij . (2) f (A, B) = i∈A j∈B
Karger has proposed a randomized contraction algorithm for ﬁnding the minimum cuts [12].
3
Graph Contraction Algorithm
The transductive learning algorithm assumes a set of given labeled samples. Initially, assign an empty label to all nodes,i.e. Li = φ for all nodes i. Then assign the labels of the given labeled nodes to their respective labels. After this initialization, the following contraction algorithm is applied: 1. Randomly select an edge (i, j) from G with probability proportional to Sij , depending on the labels of i and j, do one of the following: (a) If Li = φ and Lj = φ, then Lij = φ (b) If Li = φ and Lj = φ, then assign Lij = Lj (c) If Lj = φ and Li = φ, then assign Lij = Li (d) If Li = Lj and Li = φ and L @ j = φ, then remove the edge (i, j) from G, and return to step 1 2. Contract the edge (i, j) into a meta node ij 3. Connect all edges incident on i and j to the metanode ij while removing the edge (i, j)
Transductive Learning: Learning Iris Data with Two Labeled Data
Fig. 1. Two clusters
233
Fig. 2. Solution I
This contraction is repeated until all the nodes are contracted into a speciﬁed number of clusters or until all edges are removed. This semisupervised contraction guarantees the consistency in the labeling outcome in merging the metanodes. Furthermore, both of the algorithms above can be repeated as separate randomized trials on the data. The results of each trial can be considered as giving the probability of connectivity between individual nodes and then be combined to give more accurate estimation of the probability of connectivity. Another version of the transductive learning can be conducted through the use of a weighted cuts sampling method where the random selection of edge to be contracted are weighted by the number of elements in the nodes containing the edge. The normalized similarity between two metanodes is given by sij where sij =
sij . ni nj
(3)
where ni and nj is the number of vertices in the metanodes i and j. The normalized graph sampling algorithm selects an edge {i, j} for contraction with probability proportional to si j. This selection process ensures that the nodes with small number of elements are selected ﬁrst.
4 4.1
Results and Discussions Testing on Synthetic Data
A synthetic dataset is used for testing the transductive learning algorithm. Figure 1 shows the data for a two cluster clustering problem. The two clusters are seperated with a sinusoidal boundary. To test the performance of the transductive learning algorithm on the synthetic data, the training samples with true labels used for training are Case I: one sample; Case II: two samples; Case III: ten samples; from each classes. With a small number of labeled training samples from each class, one can judge from the ﬁgures that inference based/decision surface based classiﬁer will not be able to determine accurately the sinusoidal boundary between the cluster. Four hundred randomized trials are performed
234
Chun Hung Li and Pong Chi Yuen
Table 1. Performance of transductive learning on synthetic data: (1) Errors of combined estimation (2) Standard Deviation of estimation
Case I Case II Case III
Sampling Normalized Sampling Error SD Error SD 6.6% 0.16 1.8% 0.09 0.6% 0.09 1.2% 0.05 0.4% 0.02 0.4% 0.02
for each experiments. The average error percentage of the transductive learning is shown in Table 1. From Table 1, we can see that the combined estimation is very accurate, having errors of less than one percent. 4.2
Iris Dataset
The iris dataset is also used for testing the transductive learning. The iris dataset consists of 150 samples of measurement of the iris plant. There are three species of iris in the dataset and each species has 50 samples. Typical approaches uses 100 samples for training and 50 untrained samples for testing. The iris dataset is well studied and results can be found in numerous literature [13]. In the ﬁrst experiment in transductive learning for iris dataset, we considered using only 2 labeled data. We pick the ﬁrst sample in class 2 and ﬁrst sample in class 3 as the labeled sample and denote this experiment as experiment (a). We perform transductive learning until 3 metanodes are left after successive contractions. All objects that are contracted into the metanodes with the labeled data are assigned to the class of the labeled data. The metanode that does not has any of the labeled data as elements are considered as class one. The above trial are repeated 200 times and we approximate the probability of an object being the three classes by the number of times that the object is being assigned in the three classes divided by 200. Figure 3 shows the estimated probability of diﬀerent classes after transductive learning with 2 labeled samples. The xaxis in the graph refers to the iris plant in the sequence of iris plant dataset ﬁle. The plot of P1 shows that the ﬁrst 50 elements are all correctly classiﬁed as class one in each of the 200 trials. The plot of P2 shows that the probability for iris plant 51 to 100 have high probabilities of around 0.8 being class 2. The probability of plant 101 to 150 being class two shows a large amount of confusion. A number of plants shows a high probability of being class two which are incorrect. The plot of P3 shows the similar situations. In the second experiment, experiment (b), we use the iris plant 52 as a sample for class 2 and plant 102 as sample from class three. As we observed in the previous experiments that there are a large number of class 3 samples that are incorrectly classiﬁed as class two, further learning of class 3 data seems to be necessary. In experiment (c), we include another data from class three, namely, the iris plant data 102, in addition to 51 plant data and 101 plant data. There are four errors in this experiment. We also perform the fourth experiment, experiment (d) with similar results. Table 2 shows the
Transductive Learning: Learning Iris Data with Two Labeled Data
235
P
1
1
0.5
0
0
50
P
100
150
100
150
100
150
2
1
0.5
0
0
50
P
3
1
0.5
0
0
50
Fig. 3. Probability of class membership after transductive learning: Experiment (a) with normalized sampling ; P1 : probability of class two, (b) P2 : probability of class two, (c) P3 : probability of class three Table 2. Labeled samples used in transductive learning Exp. (a) Exp. (b) Exp. (c) Exp. (d)
class two class three 51 101 52 102 51 101,102 52 101,102
Table 3. Number of errors after transductive learning on iris dataset Exp. (a) Exp. (b) Exp. (c) Exp. (d)
samp. norl. samp. 16 14 38 6 8 4 4 4
labeled samples used in the four experiments in transductive learning of iris data. Table 3 shows the number of misclassiﬁcations of the two experiments in transductive learning of the iris dataset. We can see from the table that the errors of the normalized sampling approach are signiﬁcantly lower than the unweighted approach. Moreover the accuracy of learning with normalized sampling using two labeled samples can reach 96% while the training accuracy when using 3 labeled samples can reach 97%. Similar results are obtained using other labeled samples as training data. In general, it is found that if the labeled samples represents typical features in the respective class, the learning performances would be acceptable. If a selected sample has features vector that is similar to samples from the other class, the performances of the learning would be signiﬁcantly lower. Two transductive learning algorithms are introduced that achieves learning through the stochastic combination of learning with labeled data and the clus
236
Chun Hung Li and Pong Chi Yuen
tering of unlabeled data. The transductive learning algorithm excels in situations where a limited amount of labeled data is available. In comparing the weighted sampling and unweighted sampling approach to graphbased clustering, we ﬁnd that the normalized sampling approach has a better performance than the unweighted sampling method.
References 1. A. Blum and T Mitchell. Combining labeled and unlabeled data with cotraining. In Proceedings of the Eleventh Annual Conference on Computational Learning Theory, pages 92–100, 1998. 2. S. Goldman and Y. Zhou. Enhancing supervised learning with unlabeled data. In Proceedings of the Seventeenth International Conference on Machine Learning, 2000. 3. K. Nigam, A. McCallum, Sebastian Thrun, and Tom Mitchell. Text classiﬁcation from labeled and unlabeled documents using em. Machine Learning, 34(1), 1999. 4. T. S. Chiang and Y. Chow. Optimization approaches to semisupervised learning. In M. C. Ferris, O. L. Mangasarian, and J. S. Pang, editors, Applications and Algorithms of Complementarity. Kluwer Academic Publishers, 2000. 5. Vladimir N. Vapnik. The Nature of Statistical Learning Theory. Springer, New York, second edition, 2000. 6. Thorsten Joachims. Transductive inference for text classiﬁcation using support vector machines. In International Conference on Machine Learning (ICML), pages 200–209. MorganKaufman, 1999. 7. Ying Wu and Thomas S. Huang. Color tracking by transductive learning. In Proceedings of the International Conference on Computer Vision and Image Processing, volume 1, pages 133–138, 2000. 8. P. Perona and W. T. Freeman. A factorization approach to grouping. In Proceedings of European Conference on Computer Vision, pages 655–670, 1998. 9. J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Trans. Pattern Analysis and Machine Intelligence, 21(8):888–905, 2000. 10. Y. Gdalyahu, D. Weinshall, and M. Werman. Stochastic image segmentation by typical cuts. In Proceedings of the IEEE CVPR 1999, volume 2, pages 596–601, 1999. 11. J. Puzicha, T. Hofmann, and J. M. Buhmann. A theory of proximity based clustering: structure detection by optimization. Pattern Recognition, 33:617–634, 2000. 12. D. R. Karger and C. Stein. A new approach to the minimum cut problem. Journal of the ACM, 43(4):601–640, 1996. 13. M. Nadler and E. P. Smith. Pattern Recognition Engineering. Wileyinterscience, New York, 1993.
Approximation of TimeVarying Functions with Local Regression Models Achim Lewandowski and Peter Protzel Chemnitz University of Technology Dept. of Electrical Engineering and Information Technology Institute of Automation 09107 Chemnitz, Germany
[email protected] [email protected] Abstract. Industrial or robot control applications which have to cope with changing environments require adaptive models. The standard procedure of training a neural network oﬀline with no further learning during the actual operation of the network is not suﬃcient in those cases. Therefore, we are concerned with developing algorithms for approximating timevarying functions. We assume that the data arrives sequentially and we require an immediate update of the approximating function. The algorithm presented in this paper uses local linear regression models with adaptive kernel functions describing the validity region of a local model. While the method is developed to approximate a timevariant function, naturally it can also be used to improve the ﬁt for a timeinvariant function. An example is used to demonstrate the learning capabilities of the algorithm.
1
Introduction
Many industrial applications we are working on are characterized by highly nonlinear systems which also vary with time. Some systems change slowly e.g. due to aging, other systems have very fast timevariances e.g. due to sudden changes in the used tools or materials. Advanced control and prediction of these systems require accurate models, which leads to the problem of approximating nonlinear and timevariant functions. In contrast to timeseries prediction, we need to predict not just the next output value at time t+1 but the complete functional relationship between outputs and inputs. Global models (e.g. multilayer perceptrons with sigmoid activation functions) generally suﬀer from the StabilityPlasticitydilemma, i.e. a model update with new training data in a certain region of the input space can change the model also in other regions of the input space which is undesirable. In contrast, local models are only inﬂuenced by data points which fall into their validity regions. In the recent years, several algorithms were developed which work with local regression models. The restriction of a model to be only a local learner can be implemented in an easy way by using weighted regression whereby the weights G. Dorﬀner, H. Bischof, and K. Hornik (Eds.): ICANN 2001, LNCS 2130, pp. 237–243, 2001. c SpringerVerlag Berlin Heidelberg 2001
238
Achim Lewandowski and Peter Protzel
are given by a kernel function. Schaal and Atkeson [1] introduced an algorithm which can adjust the number and the shape of the kernel functions to yield a better approximation in the case of a timeinvariant function. Recently, another algorithm (Vijayakumar and Schaal [2]) was presented, which can deal with some special cases of highdimensional input data, again for timeinvariant functions. While these algorithms are capable to learn online (i.e. patternbypattern update), optimal results will be achieved only if the sample is repeatedly presented. We also developed an algorithm (Lewandowski et al. [3]) for the timeinvariant case based on competition between the kernel functions. The algorithm presented in this paper is a modiﬁcation and extension of our approach with the goal of approximating timevarying functions. It should be noticed that our and the other mentioned algorithms are not socalled ”lazy learners” (Atkeson, Moore and Schaal [4]). For every request to forecast the output of a new input, the already constructed local models are used. In contrast, a lazy learner working with kernel functions would construct the necessary model(s) just at the moment, when a new input pattern arrives.
2 2.1
The Algorithm Statistical Assumptions
We assume a standard regression model y = f (x, t) + ,
(1)
with x denoting the ndimensional input vector (including the constant input 1) and y the output variable. The errors are assumed to be independently distributed with variance σ 2 and expectation zero. The unknown function f (x, t) is allowed to depend on the time t. Each local model i (i = 1, . . . , m) is given by a linear model y = βiT x . (2) The parameters for a given sample are estimated by weighted regression, whereby the weighting kernel for model i is given by 1 T (3) wi (x) = exp − (x − ci ) Di (x − ci ) . 2 The matrix Di describes the shape of the kernel function and ci the location. Suppose now, that all input vectors are summarized rowbyrow in a matrix X, the outputs y in a vector Y and the weights for model i in a diagonal matrix Wi , then the parameter estimators are given by βˆi = (X T Wi X)−1 X T Wi Y .
(4)
With Pi = (X T Wi X)−1 a convenient recursion formula exists. After the arrival of a new data point (x, y), according to Ljung and S¨ oderstr¨ om [5] the following incremental update formula exists: new = βˆi + ei wi (x)Pinew x , βˆi
(5)
Approximation of TimeVarying Functions with Local Regression Models
with Pinew and
1 = λ
Pi xxT Pi Pi − λ T wi (x) + x Pi x
239
T ei = y − βˆi x .
(6)
(7)
Please notice, that a forgetting factor λ (0 < λ ≤ 1) is included. This forgetting factor is essential for our algorithm to allow faster changes during the kernel update, as the inﬂuence of older observations must vanish, if the approximating function is expected to follow the true model suﬃciently fast. 2.2
Update of Receptive Fields
Assume, that a new query point x arrives. Each model gives a prediction yˆi . These individual predictions are combined into a weighted mean, according to the activations of the belonging kernels: wi (x)yˆi yˆ = . (8) wi (x) We will now describe the way how new kernels are integrated and old kernel functions are updated. The basic idea is that models, which give better predictions for a certain area than their neighbors, should get more space while the worse models should retire. This is done by changing the width and shape of the belonging kernel functions. A new kernel function is now inserted, when no existing kernel function has an activation wi (x) exceeding a userdeﬁned threshold η > 0. As a side eﬀect, the user could be warned that the prediction yˆ of the existing models must be judged carefully. The center ci of the new kernel function is set to x and the matrix Di is set to a predeﬁned matrix D, usually a multiple of the identity matrix (D = kI). High values of k produce narrow kernels. If no new kernel function is needed, the neighboring kernel functions are updated. We restrict the update to kernel functions which exceed a second threshold δ = η 2 . After the true output y is known, the absolute errors ei  of the predictions of all activated models are computed. If the model with the highest activation belongs to the lowest error, nothing happens. If for example only one kernel function is activated suﬃciently, the kernel shapes remain unchanged. Otherwise, if e¯ denotes the mean of these absolute errors, the matrix Di of a model i, which is included in the shape update is slightly changed. Deﬁne gi = min(
ei  − e¯ , 1) , e¯
(9)
then the shape matrix is updated by the following formula: Dinew = Di +
ρgi − 1 Di (x − ci )(x − ci )T Di . (x − ci )T Di (x − ci )
(10)
240
Achim Lewandowski and Peter Protzel
The parameter ρ ≥ 1 controls the amount of changes. The essential part of the activation, ai = (x − ci )T Di (x − ci ), is given after the update by anew = ρgi ai . i
(11)
The activation of kernel functions belonging to models which perform better than the average will therefore rise for this particular x. The update of Di is constructed in that way, that activations for other input values z, which fulﬁl (z − ci )T Di (x − ci ) = 0 ,
(12)
are not changed. Inputs z, for which the vector z − ci is in a certain sense perpendicular to x − ci , receive the same activation as before. After the kernel update all linear models are updated as described in the last section. For a model with very low activation (wi (x) ≈ 0) the parameter estimator, computed as in (5), will not change, but according to equation (6) Pi will grow by the factor 1/λ, so that the belonging model gets more and more sensitive if it has not seen data for a long time. We could further use a function λ = λ(x), expressing our prior believe of variability.
3
Empirical Results
Vijayakumar and Schaal [2] and Schaal and Atkeson [1] used the following timeinvariant function to demonstrate the capabilities of their algorithm: 2 2 2 2 z = max e−10x , e−50y , 1.25e−5(x +y ) + N (0, 0.01) . (13) In their example, a sample of 500 points, uniformly drawn from the unit square was used to ﬁt a model. The whole sample was repeatedly presented patternbypattern whereby from epoch to epoch the sample was randomly shuﬄed. As a test set, a 41 × 41 grid over the unit square was used, with the true outputs without noise. For our purposes we will modify this approach. Each data point will be presented just once. The ﬁrst 3000 data points will be generated according to (13). From t = 3001 to t = 7000 the true function will change gradually according to 2 2 7000 − t −10x2 −50y2 ,e , 1.25e−5(x +y ) + N (0, 0.01) . z(t) = max e (14) 4000 From t = 7001 to t = 10000 the function will be timeinvariant again: 2 2 2 z(t) = max e−50y , 1.25e−5(x +y ) + N (0, 0.01) .
(15)
We set D = 120I, λ = 0.999, η = 0.10 and ρ = 1.05. Figure 1 shows the true function (left) and our approximation (right) for t = 3000 (top), t = 5000 (middle) and t = 10000 (bottom). The approximation quality is good, even for the case t = 5000 where the function is in a changing state.
Approximation of TimeVarying Functions with Local Regression Models
241
original (t=7000)
1.4 1.2
1.4 1.2
1
1
0.8
0.8
0.6
0.6
0.4
0.4
0.2 0.2
0 1
0 1 0.5
1 0.5
0 0
−0.5
−0.5 −1
−1
0.5
1 0.5
0 0
−0.5
−0.5 −1
−1
Fig. 1. True function (left) and approximation (right) for t = 3000 (top), t = 5000 (middle) and t = 10000 (bottom)
We started a second run with ρ left to 1, so that no kernel shape update was performed. The same 10000 data points were used. Figure 2 shows the development of the mean squared errors for ρ = 1.05 (solid) and ρ = 1.00 (dashed). With updating the kernel shapes, the error on the test set was less than half as large, after processing all 10000 data points, compared to the same algorithm with ﬁxed kernel shapes. Figure 3 shows a contour plot (wi (x) ≡ 0.7) of the activation functions of our local models, after all data points have been presented (ρ = 1.05). It is
242
Achim Lewandowski and Peter Protzel 0.01 0.009
rho=1.05 rho=1.00
0.008 0.007
MSE
0.006 0.005 0.004 0.003 0.002 0.001 0 0
2000
4000
t
6000
8000
10000
Fig. 2. Mean squared error when updating kernel shapes (ρ = 1.05) and without update (ρ = 1.00) 1 0.8 0.6 0.4 0.2 0 −0.2 −0.4 −0.6 −0.8 −1 −1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
Fig. 3. Contour plot for t = 10000 (wi (x) ≡ 0.7, ρ = 1.05)
clearly visible, that the kernel functions have adjusted their shape according to the appearance of our example function. As the function is timeinvariant for the ﬁrst 3000 points, we can compare our results with the results given in [1] and [2]. After presenting 500 data points sequentially and only once, our mean squared error on the test set was 0.0067 which corresponds to a normalized error, as it’s deﬁned by Vijayakumar and Schaal [2], of 0.0473. This is half the value they reached after showing 500 points twice and about half the error we achieved in [3] under the same conditions. This quick reduction of the error while presenting each data point only once is
Approximation of TimeVarying Functions with Local Regression Models
243
a prerequisite for the timevariant function approximation, which is our actual goal.
4
Conclusions
The increasing need in industrial applications to deal with nonlinear and timevariant systems is the motivation for developing algorithms that approximate nonlinear functions which vary with time. We presented an algorithm that works with local regression models and adaptive kernel functions based on competition. There is no need to store any data points which we consider as an advantage. The results visualized for an example show a quick adaptation and good convergence of the model after presenting each data point from the timevarying function sequentially and only once. For future work, it will be necessary to develop selfadjusting parameters. In our algorithm, the choice of the initialization matrix D in combination with the threshold value η is crucial for the performance and generalization capabilities. We are currently working on an approach to optimize these parameters online. An even more challenging problem we are working on is to anticipate changes of the true function instead of just following it.
References 1. Schaal, S., Atkeson, C.: Constructive incremental learning from only local information. Neural Computation, 10(8) (1998) 2047–2084 2. Vijayakumar, S., Schaal,S.: Locally weighted projection regression. Technical Report (2000). 3. Lewandowski, A., Tagscherer, M., Kindermann, L., Protzel, P.: Improving the ﬁt of locally weighted regression models. Proceedings of the 6th International Conference on Neural Information Processing, Vol. 1 (1999) 371–374 4. Atkeson, C., Moore, A., Schaal, S.: Locally weighted learning. Artiﬁcial Intelligence Review, 11(4) (1997) 76–113 5. Ljung, L., S¨ oderstr¨ om, T.: Theory and practice of recursive identiﬁcation. MIT Press, Cambridge (1986)
This page intentionally left blank
Theory
This page intentionally left blank
Complexity of Learning for Networks of Spiking Neurons with Nonlinear Synaptic Interactions Michael Schmitt Lehrstuhl Mathematik und Informatik, Fakult¨ at f¨ ur Mathematik RuhrUniversit¨ at Bochum, D–44780 Bochum, Germany http://www.ruhrunibochum.de/lmi/mschmitt/
[email protected] Abstract. We study model networks of spiking neurons where synaptic inputs interact in terms of nonlinear functions. These nonlinearities are used to represent the spatial grouping of synapses on the dendrites and to model the computations performed at local branches. We analyze the complexity of learning in these networks in terms of the VC dimension and the pseudo dimension. Polynomial upper bounds on these dimensions are derived for various types of synaptic nonlinearities.
1
Introduction
In many paradigms of neural computation the interaction of synaptic inputs is modeled in terms of linear operations. This is evident for standard neuron models such as the threshold, the sigmoidal, and the linear unit, but it also holds, in most cases, for the biologically closer leaky integrateandﬁre neurons and the spiking neuron models. Linearity is known to be suﬃcient for capturing the passive properties of the dendritic membrane where synaptic inputs occur in the form of currents that are combined using a summing operation. In neurobiology, however, it is well established that synaptic inputs can interact nonlinearly, for instance, when the synapses are colocalized on patches of dendritic membrane with speciﬁc properties (see, e.g., Koch [3]). In contrast to linear interactions, nonlinearities can reﬂect the spatial grouping of synapses (the “synaptic clustering”) on the dendritic tree and the computations performed at local branches. In this paper we study models for spiking neurons that use nonlinearities to combine synaptic inputs. We analyze the complexity of learning in networks of these model neurons in terms of the VapnikChervonenkis (VC) dimension and the pseudo dimension. We establish bounds for these dimensions that can be used to derive results on the sample complexity and generalization capability of these networks. (For the relationship of the VC dimension and the pseudo dimension to the complexity of learning see, e.g., Anthony and Bartlett [1].)
Work supported by the ESPRIT Working Group in Neural and Computational Learning II, NeuroCOLT2, No. 27150. A longer (8page) version of this paper is available from http://www.ruhrunibochum.de/lmi/mschmitt/.
G. Dorﬀner, H. Bischof, and K. Hornik (Eds.): ICANN 2001, LNCS 2130, pp. 247–252, 2001. c SpringerVerlag Berlin Heidelberg 2001
248
2
Michael Schmitt
Spiking Neurons with Nonlinear Synaptic Interactions
In a network of spiking neurons each node v receives inputs in the form of spikes through its synaptic connections from other nodes. Assume that v has n input connections, each of them being characterized by a synaptic weight wi ∈ IR and a transmission delay di ∈ IR+ , where IR+ = {x ∈ IR : x ≥ 0}. If the neuron presynaptic to connection i emits a spike at time ti ∈ IR+ , this generates in v a pulse described by the function t → hi (t − ti ) where hi : IR+ → IR is deﬁned as wi if di ≤ t < di + 1 , hi (t) = 0 otherwise . Hence, the pulse has unit duration and its height corresponds to the strength, or eﬃcacy, of the synapse. We consider di and wi , for i = 1, . . . , n, as adjustable parameters of the neuron. A synaptic cluster of v is a subset I ⊆ {1, . . . , n} of its synapses spatially grouped together on the dendritic tree. They interact nonlinearly where the nonlinearity is speciﬁed for each subset {i1 , . . . , il } ⊆ I by a rational function f{i1 ,...,il } in l variables. We call f{i1 ,...,il } the synaptic interaction of {i1 , . . . , il }. A synapse may participate in more than one cluster. Let F ⊆ {1, . . . , n} be the set of synapses that receive a spike during the computation of v and deﬁne JI : IR+ → 2I as JI (t) = {i ∈ I ∩ F : hi (t − ti ) =}0 , that is, JI (t) is the set of synapses in cluster I that are simultaneously active at time t. The interaction of the synapses in I is then modeled by the function MI : IR+ → IR with MI (t) = f{i1 ,...,il } (wi1 , . . . , wil ),
where {i1 , . . . , il } = JI (t) .
Assume that v has k synaptic clusters I1 , . . . , Ik . They interact linearly to yield the membrane potential P : IR+ → IR deﬁned by P (t) =
k
MIj (t) .
j=1
Neuron v ﬁres when the membrane potential reaches the threshold θ, that is, the ﬁring time tv of v is deﬁned by tv = min{t : P (t) ≥ θ}. If tv is undeﬁned then v does not ﬁre. The networks of spiking neurons that we study here use temporal coding for computations, that is, they encode information in the timing of single ﬁring events. Precisely, if neuron v ﬁres at time tv this is assumed to encode the real number tv . We consider feedforward networks with a given number of input nodes and one output node. Input vectors enter the network in terms of ﬁring times of the input nodes. The output value of the network is the ﬁring time of the output node. If the output node does not ﬁre, the network output is deﬁned
Complexity of Learning for Networks
249
to be 0. The depth of the network is the length of the longest path from an input node to the output node. The model introduced here generalizes that of [5] by allowing nonlinear interactions of synaptic inputs in terms of clusters that represent the spatial grouping of synapses. Models of spiking neurons with linear interactions are discussed, e.g., in [2,4]. We have included synaptic nonlinearities to take into account the well established neurobiological ﬁnding that the nature of the interaction of synapses colocalized on the dendritic tree is essentially nonlinear (see [3, Chapter 5]). The nonlinearities we employ here are based on arithmetic operations that can easily be implemented in analog VLSI circuits and thus be used for hardware designs of pulsed neural networks such as described in [6]. In this paper we calculate bounds on the VC dimension and the pseudo dimension of the networks introduced above. These dimensions are deﬁned as follows (see also [1]): A class F of {0, 1}valued functions in n variables is said to shatter a set of vectors S ⊆ IRn if F induces all dichotomies on S (that is, if for every partition of S into two disjoint subsets (S0 , S1 ) there is some f ∈ F satisfying f (S0 ) ⊆ {0} and f (S1 ) ⊆ {1}). The VapnikChervonenkis (VC) dimension of F is the cardinality of the largest set shattered by F. If F is a class of realvalued functions in n variables, the pseudo dimension of F is deﬁned as the VC dimension of the class {g : IRn+1 → {0, 1}  ∃f ∈ F ∀x ∈ IRn ∀y ∈ IR : g(x, y) = sgn(f (x) − y)} (where sgn(z) = 1 if z ≥ 0, and 0 otherwise). The VC dimension (pseudo dimension) of a network of spiking neurons is deﬁned to be the VC dimension (pseudo dimension) of the set of functions computed by the network with all possible assignments of values to its adjustable parameters, i.e. its delays, weights, and thresholds. To deﬁne the VC dimension for networks with realvalued output we assume that the output values are mapped to {0, 1} using some (nontrivial) threshold. Clearly, the VC dimension of a network is not larger than its pseudo dimension.
3
Bounded Depth Networks
In the following we provide an upper bound on the pseudo dimension for networks of spiking neurons that is almost linear in the number of network parameters (delays, weights, and thresholds) and the network depth. Theorem 1. Suppose N is a network of spiking neurons with W parameters and depth D where each neuron has at most k synaptic clusters and p is a bound on the degree of the synaptic interactions. Then the pseudo dimension of N is at most O(W D log(W Dkp)). The main idea of the proof is ﬁrst to derive from N and a given set of input vectors a set of rational functions in the parameter variables of N . Next, the connected components into which the parameter domain is partitioned by the zerosets of these rational functions are considered. The construction guarantees that the number of dichotomies that N induces on the set of input vectors is not larger than the number of connected components generated by these functions.
250
Michael Schmitt
Thus a bound on the number of connected components also limits the number of dichotomies. We require Deﬁnitions 7.4 and 7.5 from [1]: A set {f1 , . . . , fk } of diﬀerentiable realvalued functions on IRd is said to have regular zeroset intersections if for every nonempty set {i1 , . . . , il } ⊆ {1, . . . , k} the Jacobian (that is, the matrix of the partial derivatives) of (fi1 , . . . , fil ) : IRd → IRl has rank l at every point of the set {a ∈ IRd : fi1 (a) = · · · = fil (a) = 0} . A class G of realvalued functions deﬁned on IRd has solution set components bound B if for every k ∈ {1, . . . , d} and every {f1 , . . . , fk } ⊆ G that has regular zeroset intersections the number of connected components of the set {a ∈ IRd : f1 (a) = · · · = fk (a) = 0} is at most B. Further, a class F of realvalued functions is said to be closed under addition of constants if for every c ∈ IR and f ∈ F the function z →f (z) + c is also in F. The following Lemma gives a stronger formulation of a bound stated in Theorem 7.6 of [1]. Lemma 2. Let F be a class of realvalued functions (y1 , . . . , yd , x1 , . . . , xn ) → f (y1 , . . . , yd , x1 , . . . , xn ) that is closed under addition of constants and where each function in F is C d in the variables y1 , . . . , yd . If the class G = {(y1 , . . . , yd ) → f (y1 , . . . , yd , s) : f ∈ F, s ∈ IRn } has solution set components bound B then for any sets {f1 , . . . , fk } ⊆ F and {s1 , . . . , sm } ⊆ IRn , where m ≥ d/k, the function T : IRd → {0, 1}mk deﬁned by T (a) = (sgn(f1 (a, s1 )), . . . , sgn(f1 (a, sm )), . . . , sgn(fk (a, s1 )), . . . , sgn(fk (a, sm ))) d ≤ B(emk/d)d equivalence classes (where partitions IRd into at most B i=0 mk i d a1 , a2 ∈ IR are equivalent iﬀ T (a1 ) = T (a2 )). Proof (Theorem 1). Due to the length restriction we give only a brief sketch. Assume N as supposed, let {s1 , . . . , sm } be a set of input vectors and u1 , . . . , um be real numbers. The main task is to bound the number of dichotomies induced on S = {(s1 , u1 ), . . . , (sm , um )} by functions of the form (x, y) →sgn(f (x) − y) where f is computed by N . We proceed by induction on the levels of the network nodes where the level of a node v is deﬁned as the length of the longest path from an input node to v. As induction hypothesis we assume that Gλ−1 , λ ≥ 1, is a set of functions in the network parameters for the nodes of level at most λ − 1 that partitions the parameter domain into equivalence classes such that for parameters from the same class the computations of these nodes do not change on inputs from S in the following sense: The sequence of subsets of simultaneously active synapses remains the same for each node and the ﬁring of a node is triggered by the same set of (starting or ending) synaptic pulses. Performing induction we ﬁnally get the bound 2D (2em)W (16e2 m2 W kp)W D for the number of equivalence classes. If S is shattered this implies that m ≤ D + (W + 2W D) log m + W log(2e) + W D log(16e2 W kp) . We conclude that m = O(W D log(W Dkp)).
Complexity of Learning for Networks
251
If we restrict the type of functions used to model the synaptic interactions to the class of polynomials, we can derive a bound that improves the previous one to the eﬀect that it does not depend on the number of synaptic clusters. Theorem 3. Suppose N is a network of spiking neurons with W parameters and depth D where the synaptic interactions are polynomials of degree at most p. Then the pseudo dimension of N is at most O(W D log(W Dp)). Proof. (Sketch) The proof follows the reasoning of the previous one. The only place where the number of clusters was taken into account was in the solution set components bound. For the class of degree p polynomials in d variables the solution set components bound 2(2p)d can be used instead. This leads to the claimed result. If the depth and the degree of synaptic interactions are ﬁxed, the previous theorem yields a tight bound. The same holds for rational interactions if the number of synaptic clusters in each node is ﬁxed. We further note that the following almost linear upper bound improves the quadratic bound for networks with linear interactions given in [5]. Corollary 4. Consider the class of networks of spiking neurons with ﬁxed depth and with polynomial synaptic interactions of ﬁxed degree. Suppose N is a network from this class with W parameters. Then the pseudo dimension of N is Θ(W log W ). This bound is also valid for the class of networks with rational synaptic interactions where the depth, the degree of synaptic interactions, and the number of synaptic clusters in each neuron are ﬁxed. Proof. The upper bound is due to Theorem 1, the lower bound is due to Theorem 2.2 in [5] where it was shown that a single spiking neuron with W delay parameters has VC dimension Ω(W log W ).
4
Networks with Arbitrary Depth
Clearly, we obtain from Theorem 1 a bound not depending on the depth of the network using that the depth is not larger than the number of parameters. A direct method yields a better bound. Its proof is omitted here. Theorem 5. Let N be a network of spiking neurons with W parameters and rational synaptic interactions of degree at most p where each neuron has at most k synaptic clusters. Then the pseudo dimension of N is at most O(W 2 log(kp)). For networks with polynomial interactions the bound O(W 2 log p) holds. This leads to the following result analogous to Corollary 4. The lower bound is due to Theorem 3.1 in [5]. Corollary 6. Consider the class of networks of spiking neurons with polynomial synaptic interactions of ﬁxed degree. Suppose N is a network from this class with W parameters. Then the pseudo dimension of N is Θ(W 2 ). This bound is also valid for the class of networks with rational synaptic interactions where the degree of these interactions and the number of synaptic clusters in each neuron are ﬁxed.
252
5
Michael Schmitt
Networks with Unrestricted Order
The networks considered thus far require restricted degrees of synaptic interactions. We next establish a polynomial pseudo dimension bound even for unlimited degree. Let the interaction of synpatic cluster {i1 , . . . , il } be modeled by the (q ,...,q ) function f{i11,...,ill} in l variables and parameters q1 , . . . , ql ∈ IR with (q ,...,q )
f{i11,...,ill} (w1 , . . . , wl ) = w1q1 w2q2 · · · wlql . (We require that qi is an integer if wi < 0, to keep the network working in the real domain.) A bound on the pseudo dimension is given by the following result based on new solution set components bounds for function classes deﬁned in terms of polynomials, exponentials, and logarithms, which we omit here. Theorem 7. Suppose N is a network of spiking neurons with W parameters (q ,...,q ) and nonlinear synaptic interactions of the form f{i11,...,ill} where each neuron has at most k synaptic clusters. Then N has pseudo dimension at most O((W k)4 ).
6
Conclusions
We have derived upper bounds on the pseudo dimension of networks of spiking neurons allowing for nonlinear synaptic interactions. All bounds are polynomials of low degree. The pseudo dimension is well known to be an upper bound for the socalled fatshattering dimension. Both dimensions give bounds on the socalled covering numbers. The latter can be employed to obtain estimates for the number of training examples required for the networks to have low generalization error. The results presented here show that even when the computational power of networks of spiking neurons increases considerably due to nonlinear interactions, the sample complexity remains close to that of linear interactions. Not all bounds were shown to be tight, however, and future work could lead to improvements. It might also be possible to obtain better estimates for the sample complexity by deriving bounds on the fatshattering dimension directly.
References 1. M. Anthony and P. L. Bartlett. Neural Network Learning: Theoretical Foundations. Cambridge University Press, Cambridge, 1999. 2. W. Gerstner. Spiking neurons. In W. Maass and C. M. Bishop, editors, Pulsed Neural Networks, chapter 1, pages 3–53. MIT Press, Cambridge, Mass., 1999. 3. C. Koch. Biophysics of Computation. Oxford University Press, New York, 1999. 4. W. Maass. Computing with spiking neurons. In W. Maass and C. M. Bishop, editors, Pulsed Neural Networks, chapter 2, pages 55–85. MIT Press, Cambridge, Mass., 1999. 5. W. Maass and M. Schmitt. On the complexity of learning for spiking neurons with temporal coding. Information and Computation, 153:26–46, 1999. 6. A. F. Murray. Pulsebased computation in VLSI neural networks. In W. Maass and C. M. Bishop, editors, Pulsed Neural Networks, chapter 3, pages 87–109. MIT Press, Cambridge, Mass., 1999.
Product Unit Neural Networks with Constant Depth and Superlinear VC Dimension Michael Schmitt Lehrstuhl Mathematik und Informatik, Fakult¨ at f¨ ur Mathematik RuhrUniversit¨ at Bochum, D–44780 Bochum, Germany http://www.ruhrunibochum.de/lmi/mschmitt/
[email protected] Abstract. It has remained an open question whether there exist product unit networks with constant depth that have superlinear VC dimension. In this paper we give an answer by constructing twohiddenlayer networks with this property. We further show that the pseudo dimension of a single product unit is linear. These results bear witness to the cooperative eﬀects on the computational capabilities of product unit networks as they are used in practice.
1
Introduction
Product units are formal neurons that are diﬀerent from the most widely used neuron types in that they multiply their inputs instead of summing them. Furthermore, their weights operate as exponents and not as factors. Product units have been introduced by Durbin and Rumelhart [2] to allow neural networks to learn multiplicative interactions of arbitrary degree. Product unit networks have been proven computationally more powerful than sigmoidal networks in many learning applications by showing that they solve problems using less units than networks with summing units. Furthermore, the success of product unit networks has manifested itself in a rich collection of learning algorithms ranging from local ones, like gradient descent, to more global ones, such as simulated annealing and genetic algorithms [3,5]. In this paper we investigate the VapnikChervonenkis (VC) dimension of product unit networks. The VC dimension, and the related pseudo dimension, is well established as a measure for the computational diversity of analog computing mechanisms. It is further known to yield bounds for the complexity of learning in various wellstudied models such as agnostic [1] and online learning [7]. For a large class of neuron types a superlinear VC dimension has been established for networks, while being linear for single units [4,6,8]. This fact gives evidence that the computational power of these neurons is considerably
Work supported by the ESPRIT Working Group in Neural and Computational Learning II, NeuroCOLT2, No. 27150. A longer (9page) version of this paper is available from http://www.ruhrunibochum.de/lmi/mschmitt/. Complete proofs also appear in [9].
G. Dorﬀner, H. Bischof, and K. Hornik (Eds.): ICANN 2001, LNCS 2130, pp. 253–258, 2001. c SpringerVerlag Berlin Heidelberg 2001
254
Michael Schmitt
enhanced when they cooperate in networks. Using methods due to Koiran and Sontag [4], who considered sigmoidal networks, it is not hard to show that there exist product unit networks with superlinear VC dimension. The drawback with this result is, however, that it requires the networks to have unbounded depth. Such architectures are rarely used in practical applications, where one or two hidden layers are almost always found to be satisfactory. We show here that constantdepth product unit networks can indeed have superlinear VC dimension. In particular, we construct networks with two hidden layers of product and linear units that have VC dimension Ω(W log k), where W is the number of weights and k the number of network nodes. The result is obtained by ﬁrst establishing a superlinear lower bound for onehiddenlayer networks of a new, sigmoidaltype, summing unit. We contrast these lower bounds by showing that the pseudo dimension, and hence the VC dimension, of a single product unit is no more than linear.
2
Product Unit Networks and VC Dimension
A product unit has the form wp 1 w2 xw 1 x2 · · · xp
with input variables x1 , . . . , xp and real weights w1 , . . . , wp . The weights are the adjustable parameters of the product unit. (If xi < 0, we require that wi is an integer in order for the outputs to be realvalued.) In a monomial they are ﬁxed positive integers. Thus, a product unit is computationally at least as powerful as any monomial. Moreover, divisive operations can be expressed using negative weights. What makes product units furthermore attractive is that the exponents are suitable for automatic adjustment by gradientbased and other learning methods. We consider feedforward networks with a given number of input nodes and one output node. A network consisting solely of product units is equivalent to a single product unit. Therefore, product units are mainly used in networks where they occur together with other types of units. Here each noninput node may be a product or a linear unit (computing an aﬃne combination of its inputs). We require that the networks have constant depth, where the depth is the length of the longest path from an input node to the output node. VC dimension and pseudo dimension are deﬁned as follows (see also [1]): A class F of {0, 1}valued functions in n variables is said to shatter a set of vectors S ⊆ IRn if F induces all dichotomies on S (that is, if for every partition of S into two disjoint subsets (S0 , S1 ) there is some f ∈ F satisfying f (S0 ) ⊆ {0} and f (S1 ) ⊆ {1}). The VapnikChervonenkis (VC) dimension of F is the cardinality of the largest set shattered by F. If F is a class of realvalued functions in n variables, the pseudo dimension of F is deﬁned as the VC dimension of the class {g : IRn+1 → {0, 1}  ∃f ∈ F ∀x ∈ IRn ∀y ∈ IR : g(x, y) = sgn(f (x) − y)} (where sgn(z) = 1 if z ≥ 0, and 0 otherwise). The VC dimension (pseudo dimension) of a network is deﬁned to be the VC dimension (pseudo dimension) of the set of
Product Unit Neural Networks with Constant Depth
255
functions computed by the network with all possible assignments of values to its adjustable parameters. For networks with realvalued output we assume that the output values are mapped to {0, 1} using some (nontrivial) threshold. Clearly, the pseudo dimension of a network is not smaller than its VC dimension.
3
Superlinear Lower Bound for Product Unit Networks
We show in this section that product unit networks of constant depth can have a superlinear VC dimension. The following result is the major step in establishing this bound. We note that hidden layers are numbered here from the input nodes toward the output node. Theorem 1. Let n, k be natural numbers satisfying k ≤ 2n+2 . There is a network N with the following properties: It has n input nodes, at most k hidden nodes arranged in two layers with product units in the ﬁrst hidden layer and linear units in the second, and a product unit as output node; furthermore, N has 2nk/4 adjustable and 7k/4 ﬁxed weights. The VC dimension of N is at least (n − log(k/4)) · k/8 · log(k/8). For the proof we require the following deﬁnition: A set of m vectors in IRn is said to be in general position if every subset of at most n vectors is linearly independent. Obviously, sets in general position exist for any m and n. We also introduce a new type of summing unit that computes its output by applying the activation function τ (y) = 1 + 1/ cosh(y) to the weighted sum of its inputs. (The new unit can be considered as the standard sigmoidal unit with σ replaced by τ ). We observe that τ has its maximum at y = 0 with τ (0) = 2 and satisﬁes lim τ (y) = 1 for y → −∞ as well as for y → ∞. Lemma 2. Let h, m, r be arbitrary natural numbers. Suppose N is a network with m + r input nodes, one hidden layer of h + 2r nodes which are summing units with activation function 1 + 1/ cosh, and a monomial as output node. Then there is a set of cardinality h · m · r that is shattered by N . Proof. We choose a set {s1 , . . . , sh·m } ⊆ IRm in general position and let e1 , . . . , er be the unit vectors in IRr , that is, they have a 1 in exactly one component and 0 elsewhere. Clearly then, the set S = {si : i = 1, . . . , h · m} × {ej : j = 1, . . . , r} is a subset of IRm+r with cardinality h · m · r. We show that it can be shattered by the network N as claimed. Assume that (S0 , S1 ) is a dichotomy of S. Let L1 , . . . , L2r be an enumeration of all subsets of the set {1, . . . , r} and deﬁne the function g : {1, . . . , h · m} → {1, . . . , 2r } to satisfy Lg(i) = {j : si ej ∈ S1 } ,
256
Michael Schmitt
where si ej denotes the concatenated vectors si and ej . For l = 1, . . . , 2r let Rl ⊆ {s1 , . . . , sh·m } be the set Rl = {si : g(i) = l} . For each Rl we use Rl /m hidden nodes for which we deﬁne the weights as follows: We partition Rl into Rl /m subsets Rl,p , p = 1, . . . , Rl /m, each of which has cardinality m, except for possibly one set of cardinality less than m. For each subset Rl,p there exist real numbers wl,p,1 , . . . , wl,p,m , tl,p such that every si ∈ {s1 , . . . , sh·m } satisﬁes (wl,p,1 , . . . , wl,p,m ) · si − tl,p = 0 if and only if si ∈ Rl,p .
(1)
This follows from the fact that the set {s1 , . . . , sh·m } is in general position. (In other words, (wl,p,1 , . . . , wl,p,m , tl,p ) represents the hyperplane passing through all points in Rl,p and through none of the other points.) With subset Rl,p we associate a hidden node with threshold tl,p and with weights wl,p,1 , . . . , wl,p,m for the connections from the ﬁrst m input nodes. Since among the subsets Rl,p at most h have cardinality m and at most 2r have cardinality less than m, this construction can be done with at most h + 2r hidden nodes. Thus far, we have speciﬁed the weights for the connections outgoing from the ﬁrst m input nodes. The connections from the remaining r input nodes are weighted as follows: Let ε > 0 be a real number such that for every si ∈ {s1 , . . . , sh·m } and every weight vector (wl,p,1 , . . . , wl,p,m , tl,p ): If si ∈Rl,p then (wl,p,1 , . . . , wl,p,m ) · si − tl,p  > ε . According to the construction of the weight vectors in (1), such an ε clearly exists. We deﬁne the remaining weights wl,p,m+1 , . . . , wl,p,m+r by 0 if j ∈ Ll , wl,p,m+j = (2) ε otherwise . We show that the hidden nodes thus deﬁned satisfy the following: Claim 3. If si ej ∈ S1 then there is exactly one hidden node with output value 2; if si ej ∈ S0 then all hidden nodes yield an output value less than 2. According to (1) there is exactly one weight vector (wl,p,1 , . . . , wl,p,m , tl,p ), where l = g(i), that yields 0 on si . If si ej ∈ S1 then j ∈ Lg(i) , which together with (2) implies that the weighted sum (wl,p,m+1 , . . . , wl,p,m+r ) · ej is equal to 0. Hence, this node gets the total weighted sum 0 and, applying 1 + 1/ cosh, outputs 2. The input vector ej changes the weighted sums of the other nodes by an amount of at most ε. Thus, the total weighted sums for these nodes remain diﬀerent from 0 and, hence, the output values are less than 2. On the other hand, if si ej ∈ S0 then j ∈Lg(i) and the node that yields 0 on si receives an additional amount ε through weight wl,p,m+j . This gives a total weighted sum diﬀerent from 0 and an output value less than 2. All other nodes
Product Unit Neural Networks with Constant Depth
257
fail to receive 0 by an amount of more than ε and thus have total weighted sum diﬀerent from 0 and, hence, an output value less than 2. Thus Claim 3 is proven. Finally to complete the proof, we do one more modiﬁcation with the weight vectors and deﬁne the weights for the ouptut node. Clearly, if we multiply all weights and thresholds deﬁned thus far with any real number α > 0, Claim 3 remains true. Since lim(1 + 1/ cosh(y)) = 1 for y → −∞ and y → ∞, we can ﬁnd an α such that on every si ej ∈ S the output values of those hidden nodes that do not output 2 multiplied together yield a value as close to 1 as necessary. Further, since 1 + 1/ cosh(y) ≥ 1 for all y, this value is at least 1. If we employ for the output node a monomial with all exponents equal to 1, it follows from the reasoning above that the output value of the network is at least 2 if and only if si ej ∈ S1 . This shows that S is shattered by N . Proof (Theorem 1). Due to the length restriction we omit the proof. The idea is to take a set S constructed as in Lemma 2 and, as shown there, shattered by a network N with a monomial as output node and one hidden layer of summing units that use the activation function 1 + 1/ cosh. Then S is transformed into a set S and N into a network N such that for every dichotomy (S0 , S1 ) induced by N on S the network N induces the corresponding dichotomy (S0 , S1 ) of S. From Theorem 1 we obtain the superlinear lower bound for constant depth product unit networks. (We omit its derivation due to lack of space.) Corollary 4. Let n, k be natural numbers where 16 ≤ k ≤ 2n/2+2 . There is a network of product and linear units with n input units, at most k hidden nodes in two layers, and at most W = nk weights that has VC dimension at least (W/32) log(k/16).
4
Linear Upper Bound for Single Product Units
We now show that the pseudo dimension, and hence the VC dimension, of a single product unit is indeed at most linear. Theorem 5. The VC dimension and the pseudo dimension of a product unit with n input variables are both equal to n. Proof. That n is a lower bound easily follows from the fact that the class of monomials with n variables shatters the set of unit vectors from {0, 1}n . We derive the upper bound by means of the pseudo dimension of a linear unit. This is omitted here. Corollary 6. The VC dimension and the pseudo dimension of the class of monomials with n input variables are both equal to n.
258
5
Michael Schmitt
Conclusions
We have established a superlinear lower bound on the VC dimension of constant depth product unit networks. This result has implications in two directions: First, it gives theoretical evidence of the ﬁnding that product unit networks employed in practice are indeed powerful analog computing devices. Second, the VC dimension yields lower bounds for the complexity of learning in several models of learnability, e.g., on the sample size required for low generalization error. The result presented here can now directly applied to obtain such estimates. There are, however, models of learning for which better bounds can be obtained in terms of other dimensions, such as, e.g., the fatshattering dimension and covering numbers. A topic for future research is therefore to determine good bounds for the generalization error of product unit networks in these learning models. We have also presented here a superlinear lower bound for networks with one hidden layer of a new type of summing unit and have shown that VC and pseudo dimension of single product units are linear. This raises the interesting open problem to determine the VC dimension of product unit networks with one hidden layer. Furthermore, we do not know if the lower bounds given here are tight. Thus far, the best known upper bounds for networks with diﬀerentiable activation functions are low order polynomials (degree two for polynomials, four for exponentials). But they even hold for networks of unrestricted depth. The issue of obtaining tight upper bounds for product unit networks seems therefore to be closely related to a general open problem in the theory of neural networks.
References 1. M. Anthony and P. L. Bartlett. Neural Network Learning: Theoretical Foundations. Cambridge University Press, Cambridge, 1999. 2. R. Durbin and D. Rumelhart. Product units: A computationally powerful and biologically plausible extension to backpropagation networks. Neural Computation, 1:133–142, 1989. 3. A. Ismail and A. P. Engelbrecht. Global optimization algorithms for training product unit neural networks. In International Joint Conference on Neural Networks IJCNN’2000, vol. I, pp. 132–137, IEEE Computer Society, Los Alamitos, CA, 2000. 4. P. Koiran and E. D. Sontag. Neural networks with quadratic VC dimension. Journal of Computer and System Sciences, 54:190–198, 1997. 5. L. R. Leerink, C. L. Giles, B. G. Horne, and M. A. Jabri. Learning with product units. In G. Tesauro, D. Touretzky, and T. Leen, editors, Advances in Neural Information Processing Systems 7, pp. 537–544, MIT Press, Cambridge, MA, 1995. 6. W. Maass. Neural nets with superlinear VCdimension. Neural Computation, 6:877–884, 1994. 7. W. Maass and G. Tur´ an. Lower bound methods and separation results for online learning models. Machine Learning, 9:107–145, 1992. 8. A. Sakurai. Tighter bounds of the VCdimension of three layer networks. In Proceedings of the World Congress on Neural Networks, vol. 3, pp. 540–543. Erlbaum, Hillsdale, NJ, 1993. 9. M. Schmitt. On the complexity of computing and learning with multiplicative neural networks. Neural Computation, to appear.
Generalization Performances of Perceptrons G´erald Gavin Laboratoire E.R.I.C – Universit´e Lumi`ere, 5 Av. Pierre Mend`es France, 69676 Bron
[email protected] Abstract. This paper presents new results about the conﬁdence bounds on the generalization performances of perceptrons. It deals with regression problems. It is shown that the probability to get a generalization error greater than the empirical error plus a precision ε, depends on the number of inputs and on the magnitude of the coeﬃcients of the combination. The result presented does not require to bound a priori the magnitude of these coeﬃcients, the size and the number of layers.
1
Introduction
Classical results from statistical learning theory give conﬁdence bounds on the deviation between generalization and empirical errors in terms of combinatorial dimensions related to the class of functions used by the learning system [1]. For multilayer perceptrons, many studies (see for instance [2]) have shown that these dimensions grow at least as quickly as the number of parameters. These results are unsatisfactory since the number of parameters is often very large and the learning systems perform well with a relatively small training set. In [3], Bartlett has shown that for classiﬁcation, “the size of the weights is more important than the size of the network”. This paper aims at providing the same for regression. As far as we know, it is the ﬁrst explicit and general result about linear perceptrons.
2 2.1
Preliminaries Notations and Deﬁnitions
In the following, X will denote an arbitrary input space and Y will refer to an output space which will be taken as [−1, 1]. Let zn = {(x1 , y1 )...(xn , yn )} ∈ (X × Y )n be a ﬁnite sample i.i.d. according to an unknown distribution D. Given these observations, a classical goal of supervised learning is to ﬁnd a functional relation g, in a set G of functions, such that the quadratic generalisation error erD (g) is as low as possible. erD (g) is deﬁned as the expectation of the square of the diﬀerence between g(x) and y. erD (g) = (g(x) − y)2 dD(x, y) G. Dorﬀner, H. Bischof, and K. Hornik (Eds.): ICANN 2001, LNCS 2130, pp. 259–264, 2001. c SpringerVerlag Berlin Heidelberg 2001
260
G´erald Gavin
However, as D is supposed unknown, erD (g) can not be computed but only estimated. A classical way to estimate the generalisation error is to consider the empirical one deﬁned as the quadratic error on the training sample zn erzn (g) =
1 n
(g(xi ) − yi )2
(xi ,yi )∈zn
Our goal is then to bound the diﬀerence between the empirical and the generalization error of a function g. We are interested in bounding in probability : eG,zn = supg∈G erD (g) − erzn (g)
(1)
The less eG,zn will be, the more generalization performances will be controlled. Here we will linear combination of a set H ∈ [−1; 1]X focuse on the bounded : G∞ = { i wi hi , hi ∈ H i wi  < ∞}. According to the set H, G∞ will be for example the set of the multilayer perceptron of linear regression. For the later GA = { i wi hi , hi ∈ H s.t. i wi  < A}. 2.2
Covering Numbers
Covering numbers appear naturally in the study of generalization properties of learning machines. They are central to many bounds concerning both classiﬁcation and regression. Their deﬁnition is related to εcovers in a pseudometric space (E, ρ) : a set T ⊆ E is an ε−cover of a set A ⊂ E with respect to ρ if for all x ∈ A there is a y ∈ T such that ρ(x, y) ≤ ε. Given ε, the covering number N (ε, A, ρ) of A is then the size of the smallest ε − cover of A. In learning theory, the distance ρ is usually deﬁned from an sequence Xn = (x1 , ...., xn ), xi ∈ X. The average l1 seminorm l1Xn (f, g) = n1 i=1 f (xi ) − g(xi ) is an example of such a distance. The following theorem adapted from Vidyasagar [8] sheds light on the role of covering numbers. Theorem 1. Let D be a distribution over X × [−1, 1]. Let G ⊂ [−1, 1]X , for n ≥ ε22 , we have : ε −nε2 , G, l1X2n e 128 PDn (eG,n ≥ ε) ≤ 4ED2n N 64
Bounding the covering numbers is thus a central issue. Many studies aim at bounding its value mostly by using combinatorial dimensions such as Pseudodimension or fatshattering dimension [7]. As far as we know, very few bounds relating the covering number of GA to the covering number of H exist. Such bounds are however very interesting since they may be used for any linear combination of learning machines. By looking at the proof of Theorem 17 in the technical report related to [3], it is not diﬃcult to derive the following bound (see section 6.2): 2A22 ε ε N (ε, GA , l1X2n ) ≤ 2N ( , H, l1X2n ) + 1 2A
(2)
Generalization Performances of Perceptrons
261
which in turn can be used in theorem to give conﬁdence bounds. The resulting bound is however too large to be practical. This drawback is of importance and motivates the search for better bounds. Next section presents a new approach which leads to better results when learning speciﬁcally with quadratic loss functions.
3 3.1
New Conﬁdence Bounds General Conﬁdence Bound Dealing with Linear Combinations
Let us consider the set G∞ . We are then concerned with, erD (g) − erzn (g) eG∞ ,zn = supg∈G∞ (Ag + 1)2 Unlike previously (1), the deviation between the empirical error and the generalization one is scaled by 1/(Ag + 1)2 . By introducing this factor, we avoid to do assumptions about G∞ and stay in a general framework. This will be discussed shortly. We prove now the main result of this paper. Theorem 2. Let D a distribution over X × [−1, 1], if n ≥
2 ε2 ,
we have :
−nε2 ε PDn (eG∞ ,zn ≥ ε) ≤ 12ED2n N 2 ( , H, l1X2n ) e 128 32 The proof is detailed in section 4. Discussion The theorem may be rephrased as : with a certain probability, the inequality erD (g) ≤ erzn (g) + ε(Ag + 1)2 holds for all g ∈ G∞ . This inequality concerns a wide class of functions which does not contain necessarily the target function. The value Ag is not bounded a priori. Hence, the conﬁdence concerns the whole space G∞ instead of a restricted one like GA deﬁned in previous section. The structure of the functions appears a posteriori in eG∞ ,zn . This implies when applying this theorem, it is not needed to do assumptions about the value of Ag . A structural risk minimization is hence not required. This theorem suggests to minimize at one and the same time the empirical error and the magnitude of the output weights. It corresponds to a natural intuition : if the output weights are large, then the diﬀerence between two functions in G∞ is signiﬁcant and hence the control is less precise. The approach developped in this theorem is direct and does not require to bound the covering number of the set G∞ . From this point of view, it is an original approach. To compare our bounds to the one we obtain in combining Bartlett’s bound (2) with the theorem 1 , we consider GA ⊂ G∞ such that
262
G´erald Gavin
∀g ∈ GA , Ag < A. That is, we impose an a priori over the set G∞ . Then, according to theorem 2 ,we have : −nε2 ε , H, l1X2n ) e (A+1)2 128 PDn (eG,zn ≥ ε) ≤ 12ED2n N 2 ( 32A whereas Bartlett’s bounds combined with theorem 1 leads to : 2 8192A −nε2 ε ε2 X2n e 128 PDn (eG,zn ≥ ε) ≤ 4ED2n (2N ( , H, l1 ) + 1) 128A Hence, when minimizing the quadratic error, our bound gives a better conﬁdence than what is derived by using the approach developped in [3]. The leading 2 is indeed very large for small ε compared to the exponent 2 in exponent 8192A ε2 our bound. Our result is hence more accurate. To obtain practical bounds over the generalization error, one has to bound the covering number of the set H. Many existing results may be plugged in. We do not present them since they strongly depend on the set H. See for instance [5],[1]. 3.2
One HiddenLayer Perceptron
A one hidden layer perceptron can be seen as a linear combination of functions. The previous results could be directly used in order to get conﬁdence bounds. Let’s consider one hidden layer perceptrons deﬁned by : – p inputs – One hidden layer (size I not ﬁxed a priori; sigmoid activation function1 ) – One linear output . We denote by wi the output weights and hi the output of the ith neuron belonging to the hidden layer. The function g computed by this perceptron I can be written as a linear combination of function hi , i.e. g = w i hi . i=1
We denote by H the set of all the possible functions hi (H is the set of the functions computed by a perceptron with p inputs) and by G∞ the set of linear combination of H. In order to apply theorem 3.1, we need to bound the covering number N (H, ε, l1X ). The Pollard dimension is an extension of the VCdimension in the regression case. The Pollard dimension of H is equal to p+1. The ﬁnitude of this dimension allows to bound the covering number. Haussler [6] bounds the covering number N (F, ε, l1Xn ) for a set F . We denote by d the Pollard dimension of F and one gets the following result d 2e Xn (3) N (F, ε, l1 ) ≤ e(d + 1) ε Inequality 3 and theorem 3.1 lead to the following result 1
Same results for other classical activation functions
Generalization Performances of Perceptrons
263
Theorem 3. Let D a distribution over Rp × [−1, 1], if n ≥ ε22 , we have : 2p+2 −nε2 64e e 128 PDn (eG∞ ,zn ≥ ε) ≤ 12 (d + 1) ε This result can be written in terms of sample complexity n(ε, δ), i.e the smallest training sample size n needed to ensure PDn (eG∞ ,zn ≥ ε) ≤ δ. We show that 1 1 1 n(ε, δ) = O p ln + ln ε2 ε δ
4
Proof of Theorem 2
For sake of simplicity, we ﬁrst suppose that y is a function of x and that the distri 2 bution D is over X. By deﬁnition, we have erD (g) = X i∈I wi hi (x) − y(x) dD where I is a ﬁnite or inﬁnite countable index set. We develop the integral and permute sum and integral (bounded convergence theorem can be used here 2 since everything is bounded), we have then, for n ≥ 2/ε , erD (g) − erzn (g) ≤ (i ,i )∈I 2 (wi1 .wi2 ) ED (hi1 hi2 ) − Ezn (hi1 hi2 ) + ... 1 2 ... + i∈I (w i ) ED (y.hi ) − Ezn (y.hi1 ) + ... + ED y 2 − Ezn (y 2 ) (∗) We bound then the covering numbers for each set G0 = {hi1 hi2 , s.t. hij ∈ H}, G1 = {y.h, s.t h ∈ H} and G2 = {y 2 } by the covering number of H (see [4] for details), ε 1 = N (ε, G2 , l1X2n ) ≤ N (ε, G1 , l1X2n ) ≤ N (ε, G0 , l1X2n ) ≤ N ( , H, l1X2n )2 2 According to theorem 1, we have then, for each set Gk , k = 0, 1, 2 2 −nε2 ε X2n , H, l1 e 128 PDn ( sup ED (g) − Ezn (g) ≥ ε) ≤ 4ED2n N 32 g∈Gk We choose thus a sample zn such that, 2
∧ sup ED (h) − Ezn (h) ≤ ε
k=0h∈Gk
ε , H, l1X2n )2 The probability to have such a sample is at least 1 − 12.ED2n N ( 32 e
−nε2 128
. For such a sample zn , the inequality (∗) becomes then (wi1 .wi2 ) ε + wi ε + ε erD (g) − erzn (g) ≤ i∈I
(i1 ,i2 )∈I 2
and by refactorizing, we have ∀g ∈ G∞ erD (g) − erzn (g) ≤ ε
i∈I
2 wi  + 1
264
G´erald Gavin
We now can extend our result to distributions over X × Y = [−1, 1]. Indeed, ∼
∼
for any set F of functions f ∈ [−1, 1]X we consider the set F of functions f ∼
∈ [−1, 1]X×Y such that f (x, y) = f (x), and the target function ft ∈ [−1, 1]X×Y ∼
such that ft (x, y) = of F are equal to the ones of F y. As the covering number
2 ∼ and erD (f ) = ED f (x, y) − ft (x, y) , we can apply the above result. and this ﬁnish the proof.
5
Conclusion
The quadratic loss function, are very used in neural networks to solve regression problems. From Bartlett’s results, we can derive [3] a conﬁdence bound independent of the size of the linear combination corresponding, in the case of neural networks, to the number of hidden units of the hidden layer. Our approach, because devoted to quadratic loss functions, gives much better bounds in a quite simplier way, without using complex combinatorial notions as the fat shattering dimension. As far as we know our result is the ﬁrst, in regression, concerning G∞ , which gives a pratical conﬁdence bound on the deviation between the generalization and empirical error. Unlike Bartlett, our result does not limit, a priori, the absolute sum of the weights of the linear combination. One can use it to make regularization in reducing the structural risk erzn (g) + ε(Ag + 1)2 . We get the same result for the multilayer perceptrons case, by considering an extended computation of Ag .
References 1. N. Alon, S. BenDavid, N. CesaBianchi, and D. Haussler. Scale sensitive dimensions, uniform convergence, and learnability. In Proceedings of the ACM Symposium on foundations of Computer Science, 1993. 2. M. Anthony. Probabilistic analysis of learning in artiﬁcial neural networks: The pac model and its variants. Technical report, The London School of Economics and Political Sience, 1995. 3. P. Bartlett. The sample complexity of pattern classiﬁcation with neural networks: the size of the weights is more important than the size of the network. Technical report, Department of Systems Engineering, Australian National University, 1996. 4. G. Gavin and A. Elisseeﬀ. Conﬁdence bounds for the generalization performances of linear combinations of functions. Technical report, ERIC  Universite Lumiere, Bron, 1999. http://eric.univlyon2.fr/eric. 5. L. Gurvits and P. Koiran. Approximation and learning convex superpositions. In Computationnel Learning Theory: EUROCOLT’95, 1995. 6. D. Haussler. Sphere packing numbers for subsets of the boolean ncube with boundes vapnikchervonenkis dimension. Journal of Combinatorial Theory, 69:217–232, 1995. 7. M. Kearns and R. Schapire. Eﬃcient distributionfree learning of probabilistic concepts. In Proceedings of the 31st Symposium on the Foundations of Computer Science, pages 382–391. IEEE Computer Society Press, Los Alamitos, CA, 1990. 8. M. Vidyasagar. A Theory of Learning and Generalization. Springer, 1997.
Bounds on the Generalization Ability of Bayesian Inference and Gibbs Algorithms Olivier Teytaud and H´el`ene PaugamMoisy Institut des Sciences Cognitives, UMR CNRS 5015 67 boulevard Pinel, F69675 Bron cedex, France teytaud,
[email protected] Abstract. Recent theoretical works applying the methods of statistical learning theory have put into relief the interest of old well known learning paradigms such as Bayesian inference and Gibbs algorithms. Sample complexity bounds have been given for such paradigms in the zero error case. This paper studies the behavior of these algorithms without this assumption. Results include uniform convergence of Gibbs algorithm towards Bayesian inference, rate of convergence of the empirical loss towards the generalization loss, convergence of the generalization error towards the optimal loss in the underlying class of functions.
1
Introduction
Recent works about the Bayes Point Machine ([5,6]), reporting good results on a few benchmarks, have put into relief the interest of old well known learning paradigms such as Bayesian inference. Haussler, Kearns and Shapire [4] give results upon Bayesian learning in the zero error case, and Devroye, Gyorﬁ and Lugosi [3] recall negative results upon similar paradigms. We consider learning algorithms working on independent identically distributed samples D = {(X1 , Y1 ), . . . , (Xm , Ym )}, with Xi ∈ X and Yi ∈ {0, 1}. We consider an a priori law P r on the space W of hypothesis (this does not mean, in any of the following results, that the underlying dependency is drawn according to this distribution!). We deﬁne the following loss function, for w ∈ W : L(Y, X, w) = ln((1 − Y )w(X) + Y × (1 − w(X))). The empirical loss LE (w) is the average of L(Yi , Xi , w) on the learning sample. L(w) (the loss in generalization) is the expectation of L(Y, X, w). This means that w is not simply considered as a 9 even if threshclassiﬁer, but that w(X) = 23 when Y = 1 is worst than w(X) = 10 olding would lead to 1 in both cases. Notice that a good behavior of this loss function implies a good behavior of the loss function of the associated classiﬁer, which chooses class 1 if and only if fDm (x) > 12 . The converse is not necessarily true and an important point is that Bayesian inference can achieve a better error rate than any classiﬁer of the underlying family W of functions. For clarity, we denote R(Y, X, w) = ln(1 − exp L(Y, X, w)) and R(w) = E(R(Y, X, w)) the expectation of R(., ., w) for (X, Y ) samples. We call Bayesian inference the algorithm which proposes as classiﬁer fDm (x) = W V (w, Dm )w(x)dP r(w) with G. Dorﬀner, H. Bischof, and K. Hornik (Eds.): ICANN 2001, LNCS 2130, pp. 265–271, 2001. c SpringerVerlag Berlin Heidelberg 2001
266
Olivier Teytaud and H´el`ene PaugamMoisy Π m (exp R(Y ,X ,w))
i i V (w, Dm ) = Π mi=1 , normalized likelihood of w. Note that i=1 (exp R(Yi ,Xi ,w))dP r(w) W this function doesn’t necessarily lies in W . We call ﬁrst Gibbs algorithm ([8]) k k (x) = k1 i=1 wi (x), of order k the algorithm which proposes as classiﬁer gD m the k classiﬁers wi being drawn randomly according to V (w, Dm )P r(w). We call second Gibbs algorithm of order k the algorithm which proposes as classik k i=1 wi V (wi ,Dm ) ﬁer g˜D (x) = , wi being drawn randomly according to P r(w). k m i=1 V (wi ,Dm ) As Gibbs algorithms are non deterministic, this inference can be used in two ways, the ﬁrst one consisting in choosing once the classiﬁcation function, the second one consisting in doing new drawns of wi ’s at each example to be classiﬁed. Next we consider one random choice of the wi ’s, during the learning phase. The expectation of the loss of this algorithm is the probability of error of the nondeterministic equivalent algorithm before stochastic choice of the wi ’s. N (, F, Lp ) denotes the covering number of a space F for distance Lp . The following notions are used: Donsker classes, central limit theorems for Donsker classes, VCdimension (the VCdimension of a class of realvalued functions is the VCdimension of their subgraphs), fatshattering dimension, entropy integral, covering numbers, weak convergence, weak O(.). An introduction to these notions can be found in [12]. We recall that a class F of functions is m Donsker for a distribution of examples Xi if √1m i=1 Xi converges weakly with ∞ m to a tight limit in l (F). This implies √ uniform weak convergence of empirical means to the expectations, in O(1/ m). The main diﬀerence with VCtheory is the nonuniform (in the distribution) and asymptotic nature of this convergence. For φ lipschitzian, we recall that {φ ◦ f /f ∈ F} is Donsker provided that f is Donsker. This is usefull for proving that the family of loss functions (in particular, in the case of our loss function, which is Lipschitzian provided that w and 1 − w are lower bounded by c > 0) is Donsker provided that the family of functions used for prediction is Donsker. A few easy results can be stated. The covering numbers for L1 of the class k is chosen are bounded by N (, W, L1 )k . If the VCof functions in which gD m dimension V of W is ﬁnite, then the VCdimension of this class of functions is bounded by k×V . Lipschitz coeﬃcients in Bayesian inference or Gibbs algorithm are preserved, what can lead to bounds in terms of fatshattering dimension 1 . Bayesian averaging, in Bayes Point Machine for example, consists in using as classiﬁer the expectation of the w ∈ W such that all the data points are classiﬁed correctly by w  this is not equivalent to Bayesian inference, and some examples [10] can be given, in which Bayesian inference is signiﬁcantly asymptotically better than Bayesian averaging. Results of [9] explain how to compute Bayesian averaging. Part 2 gives bounds on the diﬀerence between the empirical loss and the loss in generalization. Part 3 gives bounds on the rate of convergence of Bayesian inference towards the optimal error rate in W . Part 4 gives a bound on the convergence of Gibbs algorithm towards Bayesian inference. The conclusion opens a question about the second version of Gibbs algorithm.
1
For examples of use of fatshattering dimension, the reader is referred to [2] or [1].
Bounds on the Generalization Ability of Bayesian Inference
2
267
Bayesian Inference: Generalization Error
Assume that W is included in a space in which integrals of functions of W → R or W → W can be computed as Rieman sums. In particular, ∃wk,i /i ≤ k, k ∈ N fDm (x) =
limk→∞
k
limk→∞
with αk,i,Dm =
i=1 k
m θk,i Πj=1 exp R(Yj , Xj , wk,i )wk,i
i=1
m θk,i Πj=1 exp R(Yj , Xj , wk,i )
m θk,i ×Πj=1 exp R(Yj ,Xj ,wk,i ) k , m l=1 θk,l Πj=1 exp R(Yj ,Xj ,wk,l )
= lim
k→∞
which leads to
k
αk,i,Dm wk,i
i=1
k
i=1
αk,i,Dm = 1.
This assumption implies that any fDm lies in the closure W of the convex hull of W for the pointwise convergence. The following results are derived from this statement. 2.1
Case 1: W Has Finite VCDimension
If the VCdimension of W is ﬁnite, then the logarithms of covering numbers of conv W are bounded by O(1/p ) with p < 2. This condition is veriﬁed in the case of a ﬁnite VCdimension V + 1, but a nondistributionfree bound on covering numbers is suﬃcient. Theorem 1 (see for example [12, p142]) If W veriﬁes N (, W, L2 ) = O(( 1 )V ), then 1 2V log N (, conv W , L2 ) = O(( ) V +2 ) Corollary 1 If W has ﬁnite VCdimension V , then the logarithm of the 2V +2 covering numbers of conv W for L2 are O(− V +3 ). A remarkable fact is that this exponent is lower than 2. This implies boundedness of the entropy integral, and so the Donsker nature of conv W . Moreover, results upon covering numbers (in [11]) give the nonasymptotic bound of theorem 2(which is not distributionfree), provided that w and 1 − w are both lower bounded by c (this condition can be removed if we consider exp L instead of L; indeed, any Lipschitzian cost function can be used, and some others as well). This result is derived from [11, p188] where the author considers empirical risk minimization. Notice that the result can be used thanks to the Lipschitzian nature of the loss function. Covering numbers for L1 (µ) need prior knowledge about the density of examples. For example, the covering numbers for L1 (µ), for a density µ bounded by K, are bounded by the K covering numbers. Notice that no prior knowledge on the distribution of Y is required; the only assumption is the law of the marginal distribution of X. This leads to the corollary 2. The dependency in is lower than 1/4 . This is uniform on distributions of examples verifying the hypothesis on the density. The following result, which is not uniform, gives a better nonuniform bound, whenever W has inﬁnite VCdimension, provided that W is Donsker.
268
Olivier Teytaud and H´el`ene PaugamMoisy
Theorem 2 (Nonasymptotic bounds resulting of covering numbers) Let the xi ’s be drawn according to density µ. Let N be the (×c)covering number for L1 (µ) of conv W . Then with probability ≥ 1 − δ, L(fDm ) − LE (fDm ) ≤ if m ≥ ln(2N/δ) 2c2 2 , with fDm chosen among an cover (this implies a discretization). Corollary 2 If W has a ﬁnite VCdimension V , there exists a universal constant M such that if the density of examples is bounded by K, then L(fDm ) − 2V +2 2 LE (fDm ) ≤ with probability ≥ 1 − δ if m ≥ K2 × (M ( K ) V +3 − 8 log(δ)). 2.2
Case 2:W Is Donsker
Theorem 3 (see for example [12, p190]) If W is Donsker, then the convex hull of W is Donsker and the pointwise closure of W is Donsker. Corollary 3 conv W is Donsker whenever W is Donsker for the underlying distribution. conv W is Donsker since W is Donsker. conv W is Donsker since conv W is Donsker. This √ implies weak convergence of empirical loss to the generalization loss in O(1/ m). This is stated in the following corollary. Corollary 4 If W is Donsker and if the integrability hypothesis stated above holds, then weak convergence √ of the diﬀerence between empirical loss and generalization loss to 0 in O(1/ m) holds for Bayesian inference.
3
Bayesian Inference: Convergence Rate
Next we assume that P r(GC()) ≥ K d , for small enough 2 , with GC() the set of classiﬁers w such that R(w) ≥ R∗ − , with R∗ = supw∈W R(w). Moreover, we assume that m ∀(w, x) ∈ W × X 0 < c ≤ w(x) ≤ 1 − c < 1. The empiri1 cal mean m i=1 (R(Yi , Xi , w)) converges uniformly towards log R(w), at rate √ O(K/ m) provided that W is Donsker (e.g., W has ﬁnite VCdimension). The m expression inf w∈GC() Πi=1 R(Yi , Xi , w) is then lower bounded by exp(m(R∗ − √ )) × exp(K m). This leads to equation 1. On the other hand, with ¬GC() the complementary set of GC() in W , equation 2 holds. This implies equation 3, which is the quotient between the ”weight”, in the integral deﬁning fDm , of ”good classiﬁers” (in GC()), and ”bad classiﬁers” (in ¬GC()), and expresses the fact that ”good classiﬁers” have much better likelihood. A(, m) is bounded √ d+1 . The ln of this, divided by m, leads by if exp(−/2)m exp(2K m) ≤ K xd 2
This condition can be relaxed to P r(GC()) = exp(−o( 1 )), which does not reduce √ the asymptotic rate of the convergence O(1/ m).
Bounds on the Generalization Ability of Bayesian Inference
269
to equation 4, hence theorem 4.
√ m Πi=1 exp R(Yi , Xi , w)dP r(w) ≥ K (exp(R∗ − /2))m exp(−K m)( )d (1) 2 GC() √ m Πi=1 exp R(Yi , Xi , w)dP r(w) ≤ (exp(R∗ − ))m exp(K m) (2) ¬GC()
√ V (w)dP r(w) exp( 2 )m exp(2K m) ¬GC() (3) ≤ A(, m) = K (/2)d V (w)dP r(w) GC() 2d √ (d + 1) ln 1 ln K /2 ≥ + + 2K/ m (4) m m
Theorem 4 (Rate of convergence) The generalization loss converges to the √ optimal loss in W at rate O(1/ m), under the assumptions above. The result is asymptotic if W is Donsker. The result is nonasymptotic and distributionfree if W has ﬁnite VCdimension.
4
Approximation by Gibbs Algorithms
One can consider Gibb’s algorithms as approximations of Bayesian Inference through MonteCarlo integration. We study the eﬃciency of this approximation. Consider the ﬁrst version of Gibbs algorithm: k (x) = gD m
k 1 wi (x) k i=1
√ The convergence rate is in O(1/ m) if the set W ⊥ = {w → w(x)/x ∈ X } is Donsker. The convergence is uniform in the distribution of wi provided that W ⊥ has ﬁnite VCdimension. This means that W has ﬁnite VCcodimension, which is true in particular if W has ﬁnite VCdimension 3 . This means that under this assumption of ﬁnite VCdimension, choosing k = Θ(m) leads to a convergence as fast as the convergence of the loss function. Theorem 5 (Dependency in k of the Gibbs algorithm) The convergence √ k towards f is distributionfree at rate O(1/ k) whenever W has ﬁnite of gD D m m VCcodimension. This is true even if conv W has inﬁnite VCdimension. A similar conclusion has not been proved in the case of the second version of Gibbs algorithm. If such a result holds, the interest would be a formal justiﬁcation of an algorithm with the following advantages: – Easily implementable. – Computational complexity: results above suggest k√ = Θ(m), leading to Θ(m2 ) evaluations of w(xi ), ensuring precision O(1/ m) on √ the loss. – Asymptotic convergence towards the optimal loss in O(1/ m) (that is not the case in most algorithms, in which the theoretical bound to be minimized is modiﬁed for the sake of computational cost, as in Support Vector Machines). 3
We recall that the VCcodimension is bounded by 2V , with V the VCdimension.
270
5
Olivier Teytaud and H´el`ene PaugamMoisy
Conclusion and Further Issues
This paper states positive general results for Bayesian inference: control of the diﬀerence between the generalization loss and the empirical loss, good asymptotic properties (convergence to the optimal loss), good nonasymptotic distributionfree properties when the VCdimension is ﬁnite, eﬃcient approximation through Gibbs algorithms provided suﬃcient conditions on k such that Gibbs algorithm well approximates Bayesian inference. More generally, the structure of conv W , family of functions used by Bayesian inference, is expressed as a Donsker class (that holds independently of the i.i.d assumption), whenever W is Donsker, and has entropy bounded in the case of ﬁnite VCdimension. Possible improvements could include nonasymptotic bounds on classes of distribution, in the spirit of [7] or some results of [12]. The most interesting issue would have been a similar result in the case of the second Gibbs algorithm, which is still an open problem. This is strongly challenging (and probably false, as put into relief by an anonymous referee), since the second algorithm presents the main advantage of being much more tractable, as it only requires random drawns of wi ’s according to P r. Such a positive result would lead to an algorithm actually minimizing a loss function, whereas usual learning algorithms, such as Support Vector Machines or Bayes Point Machines contain approximations which lead to undesired behaviors4 .
References 1. N. Alon, S. BenDavid, N. CesaBianci, D. Haussler, Scalesensitive dimensions, uniform convergence and learnability. Journal of the ACM, 44(4):615631,1997. 2. P.L. Bartlett, The sample complexity of pattern classiﬁcation with neural networks: the size of the weights is more important than the size of the network, IEEE transactions on Information Theory, 44:525536, 1998 ¨ rfi, G. Lugosi, A Probabilistic Theory of Pattern Recog3. L. Devroye, L. Gyo nition, Springer, 1997 4. D. Haussler, M. Kearns, R.E. Schapire, Bounds on the Sample Complexity of Bayesian Learning Using Information Theory and the VC Dimension, Machine Learning, 14(1):83113, 1994. 5. R. Herbrich, T. Graepel, C. Campbell, Bayes Point Machines: Estimating the Bayes Point in Kernel Space, in Proceedings of IJCAI Workshop on Support Vector Machines, pages 2327, 1999 6. R. Herbrich, T. Graepel, C. Campbell, Robuts Bayes Point Machines, Proceedings of ESANN 2000, pp4954, 2000 7. B.K. Natarajan, Learning over classes of distributions. In proceedings of the 1988 Workshop on Computational Learning Theory, pp 408409, San Mateo, CA, Morgan Kaufmann 1988. 4
Support Vector Machines are biased by the distance of points to the separating hyperplane, as the number of errors is replaced, for the sake of computational time, by a sum of distances, and Bayes Point Machines, for similar reasons, work on Bayesian averaging instead of Bayesian inference  fDm is chosen as the centre of the version space through an algorithm developped in [9].
Bounds on the Generalization Ability of Bayesian Inference
271
8. M. Opper, D. Haussler, Calculation of the learning curve of Bayes optimal classiﬁcation algorithm for learning a perceptron with noise. In Computational Learning Theory: Proceedings of the Fourth Annual Workshop, p7587. Morgan Kaufmann, 1991. ´n Playing Billiard in version space. Neural Computation, 1997 9. P. Ruja 10. O. Teytaud, Bayesian learning/Structural Risk Minimization, Research Report RR2005, Eric, http://eric.univlyon2.fr, 2000. 11. M. Vidyasagar, A theory of learning and generalization, Springer 1997. 12. A.W. van der Vaart, J.A. Wellner, Weak Convergence and Empirical Processes, Springer, 1996.
Learning Curves for Gaussian Processes Models: Fluctuations and Universality D¨ orthe Malzahn and Manfred Opper Department of Computer Science and Applied Mathematics Aston University, Birmingham B4 7ET, United Kingdom {malzahnd,opperm}@aston.ac.uk Abstract. Based on a statistical mechanics approach, we develop a method for approximately computing average case learning curves and their sample ﬂuctuations for Gaussian process regression models. We give examples for the Wiener process and show that universal relations (that are independent of the input distribution) between error measures can be derived.
1
Introduction
Gaussian process (GP) models have gained considerable interest in the Neural Computation Community (see e.g.[1,2,3,4]) in recent years. However, being nonparametric models by construction their theoretical understanding is less well developed compared to simpler parametric models like neural networks. In this paper we present new results for approximate computation of learning curves by further developing our framework from [5] which was based on a statistical mechanics approach. In contrast to most previous applications of statistical mechanics to learning theory the method is not restricted to the so called ”thermodynamic” limit which would require a high dimensional input space. Our approach has the advantage that it is rather general and may be applied to diﬀerent likelihoods and allows for a systematic computation of corrections. In this contribution we will rederive our approximation in an new way based on a general variational method. We will show that we can compute other interesting quantities like the sample ﬂuctuations of the generalization error. Nevertheless, one may criticise this and similar approaches of statistical physics as being not relevant for practical situations, because the analysis requires the knowledge of the input distribution which is usually not available. However, we will show (so far for a toy example) that our approximation predicts universal relations (that are independent of the input distribution) between diﬀerent error measures. We expect that similar relations may be obtained for more practical situations.
2
Regression with Gaussian Processes
Regression with Gaussian processes is based on a statistical model [2] where observations y(x) ∈ R at input points x ∈ RD are assumed to be corrupted values G. Dorﬀner, H. Bischof, and K. Hornik (Eds.): ICANN 2001, LNCS 2130, pp. 271–276, 2001. c SpringerVerlag Berlin Heidelberg 2001
272
D¨ orthe Malzahn and Manfred Opper
of an unknown function f (x). For independent Gaussian noise with variance σ 2 , the likelihood for a set of m example data D = (y(x1 ), . . . , y(xm ))) (conditioned on the function f ) is given by P (Df ) =
m exp − i=1
(yi −f (xi ))2 2σ 2
(1)
m
(2πσ 2 ) 2
. where yi = y(xi ). To estimate the function f (x), one supplies the a priori information that f is a realization of a Gaussian process (random ﬁeld) with zero mean and covariance C(x, x ) = E[f (x)f (x )], where E denotes the expectation over the Gaussian process prior. Predictions fˆ(x) for the unknown function f are computed as the posterior expectation of f (x), i.e. by Ef (x)P (Df ) fˆ(xD) = E{f (x)D} = Zm
(2)
where the partition function Zm normalises the posterior. In the sequel, we call the true data generating function f ∗ in order to distinguish it from the functions over which we integrate in the expectations. We will compute approximations for the learning curve, i.e. the generalization (mean square) error averaged over independent draws of example data, i.e. εg = [(f ∗ (x) − fˆ(xD))2 ](x,D) as a function of m, the sample size. We will use brackets [. . .] to denote averages over data sets where we assume that the inputs xi are drawn independently at random from a density p(x). The index at the bracket denotes the quantities that are averaged over. For example, [. . .](x,D) denotes both an average over example data D and a test input drawn from the same density. We will also approximate the sample ﬂuctuations of the generalization error deﬁned by ∆εg = [[(f ∗ (x) − fˆ(xD))2 ]2 ]D − ε2 . x
3
g
The Partition Function
As typical of statistical mechanics approaches, we base our analysis on the averaged ”free energy” [− ln Zm ]D where the partition function Zm (see Eq. (2)) is Zm = EP (Df ). [ln Zm ]D serves as a generating function for suitable posterior averages. The computation of [ln Zm ]D is based on the replica trick n ]D ∂ ln[Zm , where we compute [Z n ]D for integer n and per[ln Zm ]D = limn→0 ∂n form the continuation at the end. We have m n (fa (x)−y)2 exp − a=1 2σ 2 . n Zn (m) = [Zm ]D = E n , (3) √ n 2 2πσ (x,y)
where En denotes the expectation over the GP measure for the ntimes replicated GPs fa (x), a = 1, . . . , n.
Learning Curves for Gaussian Processes Models
273
For further analytical treatment, it is convenient to introduce the ”grand canonical” free energy Ξn (µ) =
∞ eµm Zn (m) = En exp[−Hn ] m! m=0
where the energy Hn is a functional of {fa } 2 n exp − a=1 (fa (x)−y) 2 2σ Hn = −eµ √ n 2πσ 2
(4)
.
(5)
(x,y)
This represents a ”poissonized” version of our model where the number of examples is ﬂuctuating. For suﬃciently large m, the relative ﬂuctuations are small and both models will give the same answer, provided the ”chemical potential” µ Ξn (µ) and the desired m are related by m = ∂ ln ∂µ . Using a Laplace argument for the sum in Eq. (4), we have ln Zn (m) ≈ ln Ξn (µ) + m(ln m − 1) − mµ. Note that as a result of the data average, the model deﬁned by Hn is no longer Gaussian and we cannot compute ln Ξn (µ) exactly. We will therefore resort to a variational approximation.
4
Variational Approximation
Our goalis to approximate Hn by a simpler quadratic Hamiltonian of the form n Hn0 = 12 a,b=1 ηab [(fa (x) − y)(fb (x) − y)](x,y) , where ηab are parameters to be optimised. Assuming ηab to be ﬁxed for the moment, we can expand the free energy in a power series in the deviations H − Hn0
1 −ln Ξn (µ) = − ln En exp[−Hn0 ]+H−Hn0 0 − (H−Hn0 )2 0 −H−Hn0 20 ±. . . . 2 (6) The brackets . . . 0 denote averages with respect to the eﬀective Gaussian mea0 sure induced by the replicated prior and e−Hn . As is well known [6], the ﬁrst two terms in Eq. (6) are an upper bound to − ln Ξn (µ). We will optimise Hn0 , by choosing the matrix ηab such that this upper bound is minimised. Thereafter, a replica symmetric continuation to real n is achieved by restricting the variations to the form ηab = η for a =b and ηaa = η0 . Note however, that after this continuation we can no longer establish a bound on − ln Ξn (µ). To compute the generalization error and other quantities we will use the eﬀective Gaussian measure induced by Hn0 . The variational equations on η0 and η can be expressed as functionals of the local generalization error εg (x) = lim (f1 (x) − f ∗ (x))(f2 (x) − f ∗ (x)) 0 n→0
(7)
and the local posterior variance vp (x) = lim (f1 (x) − f ∗ (x))2 0 − εg (x). n→0
(8)
D¨ orthe Malzahn and Manfred Opper Learning Curves
0
Error Measures
10
10
−1
10
−2
10
εg
−3
−4
10
Theory: Lines Simulation: Symbols Fluctuation of Error Measures
274
εt
10 0 50 100 150 200 Number m of Example Data
Fluctuations
0
10
−1
10
−2
Theory: Lines Simulation: Symbols ∆εg ∆εt 0
10
−3
m
25
∆εt
−4
10 0 50 100 150 200 Number m of Example Data
Fig. 1. Learning curves (left) and their ﬂuctuations (right) for the Wiener Process. Comparison between theory (lines) and simulation results (symbols)
By neglecting variations of these quantities with x we arrive at the following set of equations ˆ x)]x + σ 2 = [C(x,
m (η0 − η)
˜ 2 (f (x) − y)](x,y) − η[Cˆ 2 (x, x )](x,x ) = − [E
(9) mη (η0 − η)2
(10)
that determine the values of the variational parameters η0 and η. The mean ˆ (x)f (x )) in Eqs. (9,10) are deﬁned ˜ (x)−y) and the covariance C(x, ˆ x ) = E(f E(f ˜ ∝ E exp − (η0 −η) [(f (x) − y)2 ](x,y) with respect to the Gaussian measures E 2 ˆ ∝ E exp − (η0 −η) [f 2 (x)]x . and E 2
5
Results for Learning Curves and Fluctuations
We compare our analytical results for the generalization error εg , the training m 1 [ i=1 (fˆ(xi D) − y(xi ))2 ]D and for their sample ﬂuctuations ∆εg , error εt = m ∆εt with simulations of GP regression. For simplicity, we have chosen the Wiener process C(x, x ) = min(x, x ) with x, x ∈ [0, 1] as a toy model. For Fig. 1, the target function f ∗ is a ﬁxed but random realisation from the prior distribution and the data noise is Gaussian with variance σ 2 = 0.01. The left panel of Fig. 1 shows learning curves while their ﬂuctuations are displayed in the right panel. Symbols represent simulation results and our theory is given by lines. The training error εt converges to the noise level σ 2 . As one can see from the pictures our theory is very accurate when m is suﬃciently large. It also predicts the initial increase of ∆εt for small values of m (see inset of Fig. 1, right panel).
Learning Curves for Gaussian Processes Models
275
5
Theory Simulation, p(x)=2x Simulation, p(x)=1
4
3 2
εg,B /σ
m=20
2
1
m=50 m=800
0
0.2
εg,B= εt,B 0.4 2 εt,B /σ
0.6
0.8
Fig. 2. Bayes errors for the Wiener Process. Theory (bold line) versus Simulations (Symbols). Simulations have been performed for two input distributions x ∈ [0, 1] and p(x) = 1 (squares) or p(x) = 2x (triangles). Arrows indicate the number m of example data. As m increases, the Bayesian generalization error εg,B and its error bars decrease. For m → ∞ holds the trivial limes εg,B ≈ εt,B (dashed line)
6
Universal Relations
Although the explicit computations of our results requires the knowledge of the data distribution, we can establish universal relations (valid in the framework of our approximation) which are independent of this density. We restrict ourselves to the full Bayesian scenario where all quantities are averaged over the prior distribution of true functions f ∗ . The uncertainty of the prediction at a point x is measured by the posterior variance εB (x) = E(fˆ(xD) − f (x))2 . Approximations to Bayesian generalization errors deﬁned as εg,B = [εB (x)](x,D) for this scenario were computed previously by Peter Sollich [7]. For the special case of a uniform input distribution, our results turn out to be identical to Sollich’s result. However, extending our framework to arbitrary input densities, we that the ﬁnd m 1 Bayesian generalization error and its empirical estimate εt,B = m [ε i=1 B (xi )]D are expressed by a single variational parameter of our model only. This can be eliminated to give the following surprisingly simple relation ε¯g,B ≈
ε¯t,B 1 − ε¯t,B
(11)
where ε¯t,B = εt,B /σ 2 and ε¯g,B = εg,B /σ 2 . Fig. 2 displays simulation results for Wiener process regression with Gaussian noise of variance σ 2 = 0.01. We used two diﬀerent input distributions p(x) = 1 (squares) and p(x) = 2x (triangles), x ∈ [0, 1]. The number m of example data is indicated by arrows. Eq. (11) is represented by the bold line and holds for suﬃciently large m.
276
7
D¨ orthe Malzahn and Manfred Opper
Future Work
In the future, we will extend our method in the following directions: – Obviously, our method is not restricted to a regression model but can also be directly generalized to other likelihoods such as the classiﬁcation case [4,8]. A further application to Support Vector Machines should be possible. – We will establish further universal relations between diﬀerent error measures for the more realistic case of a ﬁxed (unknown) function f ∗ (x). It will be interesting if such relations may be useful to construct new methods for model selection, i.e. hyperparameter estimation. – By computing the inﬂuence of the ﬁrst neglected term in Eq. (6) which is quadratic in Hn −Hn0 , we will estimate the region in which our approximation is valid.
Acknowledgement This work has been supported by EPSRC grant GR/M81601.
References 1. D. J. C. Mackay, Gaussian Processes, A Replacement for Neural Networks, NIPS tutorial 1997, May be obtained from http://wol.ra.phy.cam.ac.uk/pub/mackay/. 2. C. K. I. Williams and C. E. Rasmussen, Gaussian Processes for Regression, in Neural Information Processing Systems 8, D. S. Touretzky, M. C. Mozer and M. E. Hasselmo eds., 514520, MIT Press (1996). 3. C. K. I. Williams, Computing with Inﬁnite Networks, in Neural Information Processing Systems 9, M. C. Mozer, M. I. Jordan and T. Petsche, eds., 295301. MIT Press (1997). 4. D. Barber and C. K. I. Williams, Gaussian Processes for Bayesian Classiﬁcation via Hybrid Monte Carlo, in Neural Information Processing Systems 9, M . C. Mozer, M. I. Jordan and T. Petsche, eds., 340346. MIT Press (1997). 5. D. Malzahn, M. Opper, Learning curves for Gaussian processes regression: A framework for good approximations, in Neural Information Processing Systems 13, T. K. Leen, T. G. Dietterich and V. Tresp, eds., MIT Press (2001), to appear. 6. R. P. Feynman and A. R. Hibbs, Quantum mechanics and path integrals, Mc GrawHill Inc., 1965. 7. P. Sollich, Learning curves for Gaussian processes, in Neural Information Processing Systems 11, M. S. Kearns, S. A. Solla and D. A. Cohn, eds. 344  350, MIT Press (1999). 8. L. Csat´ o, E. Fokou´e, M. Opper, B. Schottky, and O. Winther. Eﬃcient approaches to Gaussian process classiﬁcation. In Neural Information Processing Systems 12, MIT Press (2000).
Tight Bounds on Rates of NeuralNetwork Approximation Vˇera K˚ urkov´ a1 and Marcello Sanguineti2 1
Institute of Computer Science, Academy of Sciences of the Czech Republic Pod Vod´ arenskou vˇeˇz´ı 2, P.O. Box 5 – 182 07, Prague 8, Czech Republic
[email protected] 2 Department of Communications, Computer, and System Sciences (DIST) University of Genoa – Via Opera Pia 13, 16145 Genova, Italy
[email protected] Abstract. Complexity of neural networks measured by the number of hidden units is studied in terms of rates of approximation. Limitations of 1 improvements of upper bounds of the order of O(n− 2 ) on such rates are investigated for perceptron networks with some periodic and sigmoidal activation functions.
1
Introduction
Experience has shown that simple neural network architectures with relatively few computational units can achieve surprisingly good performances in a large variety of highdimensional problems, ranging from strongly nonlinear controllers to identiﬁcation of complex dynamic systems and pattern recognition (see, e.g. [13], [14]). The eﬃciency of such neuralnetwork designs has motivated a theoretical analysis of desirable computational capabilities of neural networks guaranteeing that the number of computational units does not increase too fast with the dimensionality of certain tasks. The dependence of the accuracy of an approximation by feedforward networks on the number of hidden units can be theoretically studied in the context of approximation theory in terms of rates of approximation. Some insight into the reason why many highdimensional tasks can be performed quite eﬃciently by neural networks with a moderate number of hidden units has been gained by Jones [4]. The same estimate of rates of approximation had earlier been proven by Maurey using a probabilistic argument (it has been quoted by Pisier [12]; see also Barron [2]). Barron [2] improved Jones’s [4] upper bound and applied it to neural networks. Using a weighted Fourier transform, he described sets of multivariable functions approximable by perceptron networks with n hidden units to an accuracy of the order of O( √1n ). Such upper bounds do not depend on
Both authors were partially supported by NATO Grant PST.CLG.976870. V. K. was ˇ 201/00/1489. M.S. was partially supported by partially supported by grant GA CR the Italian Ministry of University and Research (Project “Identiﬁcation and Control of Industrial Systems”) and by Grant D.R.42 of the University of Genoa.
G. Dorﬀner, H. Bischof, and K. Hornik (Eds.): ICANN 2001, LNCS 2130, pp. 277–282, 2001. c SpringerVerlag Berlin Heidelberg 2001
278
Vˇera K˚ urkov´ a and Marcello Sanguineti
the number of variables d of the functions to be approximated – in contrast to 1 “curse of dimensionality” rates of the order of O(n− d ) (d denotes the number of variables) which limit feasibility of linear approximation methods. However, it should be stressed that as the number of variables increases, the sets of multivariable functions to which such estimates apply become more and more constrained (see [3]), and the constants may depend on d (see [9]). MaureyJonesBarron’s upper bound is quite general as it applies to nonlinear approximation of variablebasis type, i.e., approximation by linear combinations of ntuples of elements of a given set of basis functions. This approximation scheme has been widely investigated: it includes freenodes splines, freepoles rational functions and feedforward multilayer neural networks with a single linear output unit. Several authors have further improved or extended these upper bounds and investigated limitations of such improvements. Makovoz [10] improved Maurey’s probabilistic argument by combining it with a concept from metric entropy theory, which he also used to show that in the case of Lipschitz sigmoidal perceptron networks, the upper bound cannot be improved to O(n−α ) for α > 12 + d1 , where d is the number of variables of the functions to be approximated. A similar tightness result for perceptron networks was earlier obtained by Barron [1], who used a more complicated proof technique. For the special case of orthonormal variablebasis, Mhaskar and Micchelli [11] and K˚ urkov´ a, Savick´ y and Hlav´ aˇckov´ a [9] have derived tight improvements of MaureyJonesBarron’s bound. In this paper, we extend such tightness results for two types of sets of basis functions: (1) for an orthonormal basis and (2) for a basis satisfying certain conditions on polynomial growth of the number of sets of a given diameter needed to cover such basis and having suﬃcient “capacity” in the sense that its convex balanced hull has an orthogonal subset that for each integer k contains at least k d functions with norms greater or equal to k1 . The condition (1) is satisﬁed by perceptrons with sine or cosine activations, and the condition (2) by perceptrons with Lipschitz sigmoidal activations. We show that in the ﬁrst case, the MaureyJonesBarron’s upper bound can be at most improved by a factor depending on the ratio between two norms (a norm in which the approximation error is measured and a norm tailored to the set of basis functions), but the term 1 1 n− 2 remains essentially unchanged (it is only replaced by 12 (n − 1)− 2 ), and in 1 1 the second case, the upper bound cannot be improved beyond O(n−( 2 + d ) ).
2
Preliminaries
Approximation by feedforward neural networks can be studied in a more general context of approximation by variablebasis functions. In this approximation scheme, elements of a real normed linear space (X, .) are approximated by linear combinations of at most n elements nof a given subset G. The set of such combinations is denoted by spann G = { i=1 wi gi ; wi ∈ R, gi ∈ G}; it is equal to the union of ndimensional subspaces generated by all ntuples of elements of G. G can represent the set of functions computable by hidden units in neural
Tight Bounds on Rates of NeuralNetwork Approximation
279
networks. Recall that a perceptron with an activation function ψ : R → R computes functions of the form ψ(v · x + b) : Rd+1 × Rd → R, where v ∈ Rd is an input weight vector and b ∈ R is a bias. By Pd (ψ) = {f : [0, 1]d → R; f (x) = ψ(v·x+b), v ∈ Rd , b ∈ R} we denote the set of functions on [0, 1]d computable by ψperceptrons. The most common activation functions are sigmoidals, i.e., functions σ : R → [0, 1] such that limt→−∞ σ(t) = 0 and limt→∞ σ(t) = 1. A function σ : R → R is Lipschitz if there exists a > 0 such that σ(t) − σ(t ) ≤ at − t  for all t, t ∈ R. Rates of approximation of functions from a set Y by functions from a set M can be studied in terms of the worstcase error formalized by the concept of deviation of Y from M deﬁned as δ(Y, M ) = δ(Y, M, (X, .)) = supf ∈Y f − M = supf ∈Y inf g∈M f − g. To formulate estimates of deviation from spann G we need to introduce a few more concepts and notations. If G is a subset of (X, .) and c ∈ R, then we deﬁne c G = {c g; g ∈ G} and G(c) = {wg; g ∈ G, w ∈ R & w ≤ c}. The closure of G is denoted by cl G and deﬁned as cl G = {f ∈ X; (∀ε > 0)(∃g ∈ G)(f − g < ε)}. The convex hull of G, denoted by conv n G, is the set of all nconvex combinations of its elements, i.e., conv G = { i=1 ai gi ; ai ∈ [0, 1], i=1 ai = 1, gi ∈ G, n ∈ N+ }. convn G denotes the set n of all convex combinations of n elements of G, i.e., convn G = { i=1 ai gi ; ai ∈ n [0, 1], i=1 ai = 1, gi ∈ G}. Br (.), Sr (.) denotes the ball, sphere, resp., of radius r with respect to the norm ., i.e., Br (.) = {x ∈ X; x ≤ r} and Sr (.) = {x ∈ X; x = r}. The following estimate is a version of Jones’ result as improved by Barron [2] and also of earlier result of Maurey. Recall that a Hilbert space is a normed linear space with the norm induced by an inner product. Theorem 1. Let (X, .) be a Hilbert space, b a positive real number, G a subset of X such that for every g ∈ G g ≤ b, and let f ∈ cl conv G. Then, for every positive integer n, f − convn G ≤
b2 −f 2 . n
As convn G(c) ⊂ spann G(c) = spann G for any c ∈ R, by replacing the set G by G(c) = {wg; w ∈ R, w ≤ c, g ∈ G} we can apply Theorem 1 to all elements of ∪c∈R+ cl conv G(c). This approach can be mathematically formulated in terms of a norm tailored to a set G Let (X, .) be a normed linear space and G its subset, then Gvariation (variation with respect to G) denoted by .G is deﬁned as the Minkowski functional of the set cl conv G(1) = cl conv(G ∪ −G), i.e., f G = inf{c ∈ R+ ; f ∈ cl conv G(c)}. Gvariation has been introduced by K˚ urkov´ a [5] as an extension of Barron’s [1] concept of variation with respect to halfspaces (more precisely, variation with respect to characteristic functions of halfspaces) corresponding to perceptrons with discontinous threshold (Heaviside) activation function. The following theorem is a corollary of Theorem 1 formulated in terms of Gvariation (see [5]). Note that for any G, the unit ball in Gvariation is equal to cl conv(G ∪ −G).
280
Vˇera K˚ urkov´ a and Marcello Sanguineti
Theorem 2. Let (X, .) be a Hilbert space and G be its subset. for ev Then, f 2G s2G −f 2 = ery f ∈ X and every positive integer n, f − spann G ≤ n 2 2 f s f G sG sG √ 1 − f 2 sG2 and δ(B1 (.G ), spann G) ≤ √ , where sG = supg∈G g. n n G G
3
Tight Bounds for Orthonormal Variable Bases
The results in this section extend to inﬁnitedimensional spaces tight estimates derived by K˚ urkov´ a, Savick´ y and Hlav´ aˇckov´ a [9] for ﬁnitedimensional spaces as improvements of earlier estimates obtained by Mhaskar and Micchelli [11]. When G is an orthonormal basis of a separable Hilbert space (X, .), then Gvariation is equal to the l1 norm with respect to G, denoted by .1,G and deﬁned as f 1,G = g∈G f · g (see [8]). The following two tight bounds take advantage of this equality (for their proofs see [8]). Theorem 3. Let (X, .) be an inﬁnitedimensional separable Hilbert space and G be its orthonormal basis. Then, for every positive real number b and every positive integer n, δ(Bb (.G ), spann G) = δ(Bb (.1,G ), spann G) = 2√b n . When G is a countable inﬁnite orthonormal basis, Theorem 3 improves the upper bound from Corollary 2 up to an exact value of the deviation of balls in Gvariation from spann G. In contrast to Theorem 2, which expresses an upper bound in terms of both f G , and f , Theorem 3 does not take f into account. However, even without using the value of f , it gives√ a better bound f f 3 than Theorem 2 when the ratio f G is suﬃciently large (if 2 < f G , then 2 f G −f 2 √ G ). The following theorem gives, for G orthonormal, the max< f n 2 n imum possible improvement of Theorem 2, using both .G and .. Theorem 4. Let (X, .) be an inﬁnitedimensional separable Hilbert space, G its orthonormal basis, and b, r real numbers such that 0 ≤ r ≤ b. Then, for every positive integer n ≥ 2: 2 1 b 1 − rb2 ≤ δ((Sb (.1,G ) ∩ Sr (.)) , spann G) ≤ , then 4√n−1 (i) if rb ≥ √2n−1 r2 √b 1 − ; 2 b 2 n−1 √ √ 1 r , then 2√ ≤ δ(Sb (.1,G ) ∩ Sr (.), spann G) ≤ (ii) if n − n − 1 ≤ rb < √2n−1 2 2 b r √ 1 − b2 ; 2 n−1 √ √ r r ≤ δ(Sb (.1,G ) ∩ Sr (.), spann G) ≤ r. (iii) if b < n − n − 1, then 2√ 2 Thus for an orthonormal set G, MaureyJonesBarron’s bound can at most be improved by a factor depending on the ratio between the two norms . and .G , but the term √1n remains essentially unchanged: it is only replaced by √1 . 2 n−1 Let (L2 ([0, 1]d ), .2 ) denote the Hilbert space of L2 functions on [0, 1]d with 1/2 . Using Theorems 3 and 4, the L2 norm deﬁned as f 2 = [0,1]d f 2 (x) dx
Tight Bounds on Rates of NeuralNetwork Approximation
281
we can derive upper bounds on the deviation from perceptron networks with sine or cosine activations, in terms of the Fourier coeﬃcients of functions in (L2 ([0, 1]d ), .2 ) with respect to a suitable orthonormal basis. Corollary 5 Let d be a positive integer; then, in (L2 ([0, 1]d ), .2 ): (i) for every positive real number b and every positive integer n, δ({f ; k∈N d f˜(k) ≤ b}, spann Pd (sin)) = 2√b n ; √ (ii) for every 0 ≤ r ≤ b and every positive integer n ≥ 2, if rb ≥ n − √ ˜ ˜ 2 n − 1, then δ({f ; k∈N d f (k) = b & k∈N d f (k) = r}, spann Pd (sin)) ≤ 2
√b n−1
1−
r2 b2
;
√ √ (iii) for every 0 ≤ r ≤ b and every positive integer n ≥ 2, if rb < n − n − 1, then δ({f ; k∈N d f˜(k) = b & k∈N d f˜(k)2 = r}, spann Pd (sin)) ≤ r.
4
Tightness of the Bound O(n− 12 ) for Sigmoidal Perceptrons
The results in this section extend tightness results derived by Barron [1] and Makovoz [10]. We disprove the possibility of an improvement of MaureyJones1 1 Barron’s upper bound beyond O(n( 2 + d ) ) for variable bases satisfying certain conditions deﬁned in terms of (i) polynomial growth of the number of sets of a given diameter needed to cover such basis and (ii) suﬃcient “capacity” of the basis, in the sense that its convex hull has an orthogonal subset that for each positive integer k, contains at least k d functions with norms greater or equal to k1 . Recall that for ε > 0, the εcovering number of a subset K of a normed linear space (X, .) is deﬁned as covε K = covε (K, .) = min{n ∈ N+ ; K ⊆ ∪ni=1 Bε (xi , .), xi ∈ K} if the set over which the minimum is taken is nonempty, otherwise covε (K) = +∞. For a positive integer d (corresponding, in the following, to the number of variables of functions in X), we call a subset A of a normed linear space (X, .) not quickly vanishing with respect to d if A = ∪k∈N+ Ak , where, for each k ∈ N+ , card Ak ≥ k d and for each h ∈ Ak , h ≥ k1 . Theorem 6. Let (X, .) be a Hilbert space of functions of d variables and G be its bounded subset satisfying the following conditions: (i) there exist a polynomial p(d) and b ∈ R+ such that, for every ε > 0, p(d) covε (G) ≤ b 1ε ; (ii) there exists r ∈ R+ for which Br (.G ) contains a set of orthogonal elements which is not quickly vanishing with respect to d. Then δ(B1 (.G ), convn (G ∪ −G)) ≤ O(n−α ) implies α ≤ 12 + d1 . For the proof of Theorem 6 and veriﬁcation that its assumptions are satisﬁed by perceptrons with Lipschitz sigmoidal activation (see [6]). Corollary 7 Let d, n be positive integers and let σ : R → R be a Lipschitz sigmoidal function. Then in (L2 ([0, 1]d ), .2 ), δ(B1 (.Pd (σ) ), convn (Pd (σ) ∪ −Pd (σ)) ≤ O(n−α ) implies α ≤ 12 + d1 .
282
5
Vˇera K˚ urkov´ a and Marcello Sanguineti
Discussion
We have shown that the upper bounds on worstcase errors following from MaureyJonesBarron’s theorem cannot be considerably improved for networks with onehiddenlayer of either Lipschitz sigmoidal or cosine activation function. Better rates might be achievable using networks with more than one hidden layer or when sets of functions to be approximated are more restricted than sets deﬁned in terms of the two norms considered in this paper.
References 1. Barron, A.R.: Neural net approximation. Proc. 7th Yale Workshop on Adaptive and Learning Systems (K. Narendra, Ed.), pp. 6972. Yale University Press, 1992. 2. Barron, A.R.: Universal approximation bounds for superpositions of a sigmoidal function. IEEE Trans. on Information Theory 39, pp. 930945, 1993. 3. Girosi, F., Jones, M., and Poggio, T.: Regularization theory and neural networks architectures. Neural Computation 7, pp. 219269, 1995. 4. Jones, L.K.: A simple lemma on greedy approximation in Hilbert space and convergence rates for projection pursuit regression and neural network training. Annals of Statistics 20, pp. 608–613, 1992. 5. K˚ urkov´ a, V.: Dimensionindependent rates of approximation by neural networks. In ComputerIntensive Methods in Control and Signal Processing. The Curse of Dimensionality (K. Warwick, M. K´ arn´ y, Eds.). Birkhauser, Boston, pp. 261270, 1997. 6. K˚ urkov´ a, V. and Sanguineti, M.: Covering numbers and rates of neuralnetwork approximation. Research report ICS00830, 2000. 7. K˚ urkov´ a, V. and Sanguineti, M.: Tools for comparing neural network and linear approximation. IEEE Trans. on Information Theory, 2001 (to appear). 8. K˚ urkov´ a, V. and Sanguineti, M.: Bounds on rates of variablebasis and neuralnetwork approximation. IEEE Trans. on Information Theory, 2001 (to appear). 9. K˚ urkov´ a, V., Savick´ y, P., and Hlav´ aˇckov´ a, K.: Representations and rates of approximation of real–valued Boolean functions by neural networks. Neural Networks 11, pp. 651659, 1998. 10. Makovoz, Y.: Random approximants and neural networks. J. of Approximation Theory 85, pp. 98–1098, 1996. 11. Mhaskar, H. N. and Micchelli, C. A.: Dimensionindependent bounds on the degree of approximation by neural networks. IBM J. of Research and Development 38, n. 3, pp. 277–283, 1994. 12. Pisier, G.: Remarques sur un resultat non publi´e de B. Maurey. Semi´ naire d’Analyse Fonctionelle, vol. I, no. 12. Ecole Polytechnique, Centre de Math´ematiques, Palaiseau, 198081. 13. Sejnowski, T. J. and Rosenberg, C.: Parallel networks that learn to pronounce English text, Complex Systems 1, pp. 145168, 1987. 14. Zoppoli, R., Sanguineti, M., and Parisini, T.: Approximating networks and extended Ritz method for the solution of functional optimization problems. J. of Optimization Theory and Applications, 2001 (to appear).
Kernel Methods
This page intentionally left blank
Scalable Kernel Systems Volker Tresp1 and Anton Schwaighofer1,2 1
Siemens AG, Corporate Technology, OttoHahnRing 6, 81739 M¨ unchen, Germany {Volker.Tresp,Anton.Schwaighofer.external}@mchp.siemens.de 2 TU Graz, Institute for Theoretical Computer Science Inﬀeldgasse 16b, 8010 Graz, Austria asc
[email protected] http://www.igi.tugraz.ac.at/aschwaig/
Abstract. Kernelbased systems are currently very popular approaches to supervised learning. Unfortunately, the computational load for training kernelbased systems increases drastically with the number of training data points. Recently, a number of approximate methods for scaling kernelbased systems to large data sets have been introduced. In this paper we investigate the relationship between three of those approaches and compare their performances experimentally.
1
Introduction
Kernelbased systems such as the support vector machine (SVM) and Gaussian processes (GP) are powerful and currently very popular approaches to supervised learning. Kernelbased systems have demonstrated very competitive performance on several applications and data sets and have great potential for KDDapplications since their degrees of freedom grow with training data size and they are therefore capable of modeling an increasing amount of detail when appropriately many training data points become available. Unfortunately, there are at least three problems when one tries to scale up these systems to large data sets. First, training time increases drastically with the number of training data points, second, memory requirements increase with data set size and third, prediction time is proportional to the number of kernels and the latter is equal to (or at least increases with) the number of training data points. In this presentation, we will concentrate on Gaussian processes which are the basis for Gaussian process regression, generalized Gaussian process regression, and the support vector machine. We analyze and experimentally compare three recently introduced approaches towards scaling Gaussian processes to large data sets using ﬁnitedimensional representations, thus obtaining learning rules which scale linearly in the number of training data points. The ﬁrst approach is the subset of representers method (SRM) and can be found in the work of Wahba [5], in the work on sparse greedy Gaussian process regression by Smola and Bartlett [2], and in the reduced support vector machine by Lee and Mangasarian [1]. The SRM is based on a factorization of the kernel G. Dorﬀner, H. Bischof, and K. Hornik (Eds.): ICANN 2001, LNCS 2130, pp. 285–291, 2001. c SpringerVerlag Berlin Heidelberg 2001
286
Volker Tresp and Anton Schwaighofer
functions. The second variant is a reduced rank approximation (RRA) of the Gram matrix introduced in the work of Williams and Seeger [6]. The RRA uses the same decomposition as the SRM but this decomposition is only applied to the Gram matrix. The third variant is the BCM approximation introduced by Tresp [3]. Here the starting point is the optimal projection of the data on a set of base kernels which requires the inversion of a covariance matrix of the size of the number of training data points. The BCM approximation is achieved by a block diagonal approximation of this matrix. In this paper, we analyze the approaches from a common view point and we will compare the performances of the approximations where we pay particular attention to the issue of the optimal scale parameter. The paper is organized as follows. In the next section, we will provide a brief introduction into Gaussian processes. In the following sections, we describe the SRM, the RRA and the BCM approximation. Section 6 analyzes the approximations and provides experimental comparisons. Section 7 contains the conclusions.
2
Gaussian Process Regression (GPR)
In GPR one assumes that a priori a function f (x) is generated from an inﬁnitedimensional Gaussian distribution with zero mean and covariance K(xi , xj ) = cov(f (xi ), f (xj )) deﬁned at input points xi and xj . Furthermore, we assume a set of training data D = {(xi , yi )}N i=1 where targets are generated according to yi = f (xi ) + i where i is independent additive Gaussian noise with variance σ 2 . The optimal regression function fˆ(x) takes on the form of a weighted combination of kernel functions N wi K(x, xi ). (1) fˆ(x) = i=1
Based on our assumptions, the maximum a posterior (MAP) solution for w = (w1 , . . . , wN ) minimizes the cost function 1 1 w Σw + 2 (Σw − y) (Σw − y) 2 2σ
(2)
where (Σ)i,j = K(xi , xj ) is the N × N Gram matrix. The optimal weight vector is the solution to a system of linear equation which in matrix form becomes (Σ + σ 2 I)w = y.
(3)
Here y = (y1 , . . . , yN ) is the vector of targets and I is the N × N dimensional unit matrix. The experimenter has to specify the positive deﬁnite kernel function. A common choice is that K(xi , xj ) = A exp(−1/(2γ 2 )xi − xj 2 )
(4)
which is a Gaussian with positive amplitude A and scale parameter γ. Other positive deﬁnite covariance functions are also used.
Scalable Kernel Systems
3
287
The Subset of Representers Method (SRM)
Here, one ﬁrst selects as set of Nb base kernels. These base kernels are typically either deﬁned at a subset of the training data or of the test data. One can now approximate the covariance at xi and xj as cov(f (xi ), f (xj )) ≈ (K b (xi )) (Σ b,b )−1 K b (xj ).
(5)
Here, Σ b,b is the covariance matrix for the base kernel points and K b (xi ) is the vector of covariances between the functional values at xi and the base kernel points. Since Σ b,b contains the covariances at the base kernel points, this approximation is an equality if either xi or xj are elements of the base kernel points and is an approximation otherwise. Note that using this approximation, the Gram matrix becomes Σ ≈ Σ m,b (Σ b,b )−1 (Σ m,b ) ,
(6)
where Σ m,b contains the covariance terms between all N training data points and the base kernel points. With this approximation, the rank of the Gram matrix Σ cannot be larger than Nb . The regression function is now a superposition fˆ(x) =
Nb
wi K(x, xi )
(7)
i=1
of only Nb kernel functions and the optimal weight vector minimizes the cost function 1 1 b b,b b (w ) Σ w + 2 (Σ m,b wb − y) (Σ m,b wb − y), (8) 2 2σ where wb = (w1 , . . . , wNb ) , and where y is the vector of all training targets. Note that the number of kernels is now Nb instead of N , hence the name subset of representers method (SRM). 1 Usually, the base kernels are selected from the training data set either randomly or using a clustering algorithm [5]. Smola and Bartlett [2] select an (nearly) optimal subset of base kernel points out of the training data set. Their base kernel point selection procedure does not signiﬁcantly increase the computationally complexity of the training procedure.
4
A Reduced Rank Approximation (RRA)
In a paper by Williams and Seeger [6], the authors use the decomposition of the Gram matrix of Equation 6 for calculating the kernel weights (Equation 3). Using standard matrix algebra (Woodbury formula), one obtains −1 m,b 1 (Σ ) y . wopt ≈ 2 y − Σ m,b (Σ m,b ) Σ m,b + σ 2 Σ b,b σ 1
Incidentally, the relationship between the full kernel weights and the reduced kernel weights is given by wb = (Σ b,b )−1 (Σ m,b ) w. Substitution of this identity in the cost function of Equation 8 and using Equation 5 leads to the cost function of Equation 2.
288
Volker Tresp and Anton Schwaighofer
In the SRM method, the decomposition of Equation 5 changes the covariance structures of the kernels, whereas here, the covariance structures deﬁning the kernels are unchanged. The factorization of the Gram matrix is used to obtain an eﬃcient approximation of the optimal kernel weights. As a result, in the RRA approximation the number of kernels with nonzero weights is identical to the number of training data points N (Equation 1), whereas in the SRM method, the number of kernels with nonzero weights is identical to the number of base points Nb (Equation 7).
5
BCM Approximation
The Bayesian committee machine (BCM) was introduced by Tresp [3] and was derived using assumptions about conditional independencies. Here, we will choose a new approach to derive the BCM approximation. Let P (f b ) = G(f b ; 0, Σ b,b ) be the Gaussian prior distribution of the unknown functional values at the base kernel points. Furthermore, let P (yf b ) = G(y; Σ m,b wb , cov(yf b )) be the conditional density of the training targets given f b . Here, wb is the weights vector deﬁned on the base kernels, and cov(yf b ) = σ 2 I + Σ − Σ m,b (Σ b,b )−1 (Σ m,b )
(9)
is the covariance of the training data given f b . Note that both equations deﬁne a joint probability model and allow the calculation of many quantities of interest, e.g. E(f b y). To be able to compare the BCM and the SRM, we will use the identity f b = Σ b,b wb .
(10)
The optimal wb then minimizes the cost function 1 b b,b b 1 m,b b (w ) Σ w + (Σ w − y) cov(yf b )−1 (Σ m,b wb − y). 2 2
(11)
Note that the errors in the likelihood term are correlated. Equations 10 and 11 can be used to calculate the optimal prediction at the base kernel points but this requires the calculation of the inverse of cov(yf b ) and the latter has the dimension of N × N . The BCM uses a block diagonal approximation of cov(yf q ) and the calculation of the weight vector wb requires the inversion of matrices of only the block size B. The BCM approximation improves if few blocks are used (then a smaller number of elements are set zero) and when Nb is large, since then the last two terms on the right side of Equation (9) tend to cancel and cov(yf b ) ≈ σ 2 I. Note that the BCM approximation becomes
Scalable Kernel Systems
289
the SRM if we set cov(yf b ) = σ 2 I. In the latter, the induced correlations are completely ignored. With the BCM approximation we obtain −1 M M m,b m,b b b,b i b −1 m,b (Σi ) cov(y f ) Σi (Σi ) cov(y i f b )−1 y i wopt ≈ Σ + i=1
i=1
which is one particular form of the BCM approximation. Here, M is the number of blocks, y i is the vector of targets of the ith module, cov(y i f b ) is the ith diagonal block of cov(yf b ) and Σim,b is the submatrix of Σ m,b containing the covariances between the base kernel points and the training data points in the ith partition. The predictions at the base kernel points can be obtained by b in Equation 10. The predictions at additional test points can substituting wopt b be calculated by substituting wopt in Equation 7.
6
Experimental Comparisons
In the ﬁrst experiment, the base points were selected out of the test set. This corresponds to a procedure sometimes referred to as transduction where the user starts training the system only after the inputs to the test points become available. The experimental results presented in Figure 1 (A), (B) show that for the optimal scale parameters, the BCM approximation makes signiﬁcantly better predictions at the base kernel points if compared to the SRM. For large scale parameters, cov(yf q ) is approximately diagonal and both approximations give comparable results. For smaller scale parameters, the block diagonal approximation is better than the diagonal approximation and the BCM gives better results than the SRM. The predictions of the RRA at the base points are identical to the predictions of the SRM method. Based on the predictions at the base points, one can calculate prediction at additional test point. The results shown in Figure 1 (C) show that the results of the BCM and the RRA method are comparable, although the latter gives slightly better results.2 Figure 1 (D) shows the test set error if the standard procedure is used. Here, the base points are randomly chosen out of the training data set, weights on the base points in the training data set are calculated, and these are then used to predict at the test points. As expected, the results shown in (D) are comparable to the results in (C). For the experiments in (C) and (D), the results of the RRA (not shown) were considerably worse than the results obtained using the BCM approximation and the SRM method.
7
Conclusions
In this paper, we have compared three approaches for scaling up kernelbased systems. The computational complexity of the presented methods scales as O(N × 2
Of course one could calculate the BCM approximation again for the additional test points instead of using Equation 1. This would give better results but would require another O(N m2 ) operations.
290
Volker Tresp and Anton Schwaighofer (A)
(B) 0.01
test error at base points
test error at base points
0.01 0.008 0.006 0.004 0.002 0 0.4
0.6 0.8 scale parameter
0.008 0.006 0.004 0.002 0 0.4
1
(C) 0.01
test error at non−base points
test error at non−base points
1
(D)
0.01 0.008 0.006 0.004 0.002 0 0.4
0.6 0.8 scale parameter
0.6 0.8 scale parameter
1
0.008 0.006 0.004 0.002 0 0.4
0.6 0.8 scale parameter
1
Fig. 1. Test set error is plotted against the scale parameter (width γ) of a Gaussian kernel for the BCM (continuous) and for the SRM (dashed). For (A), (B), and (C), the base points were randomly selected out of the test set. (A) and (B) show the performance at the base points and (C) shows the performance at additional test points. For the experiment in (D), the base points were randomly selected out of the training data set and the error on an independent test set is shown. We used 10000 training data points, 1000 base kernel points and 1000 additional test points. The plots are based on an artiﬁcial data set with additive noise with variance σ 2 = 0 (A, C, D) and σ 2 = 0.001 (B). The test data are noise free.
Nb2 ) where N is the number of training data points and Nb is the number of base kernel points. If training is performed after the test inputs are known (transduction), the BCM outperforms the other approaches. In the more common setting where training is done before the inputs to the test set are available (induction), all three methods perform comparably, although the subset of representers method seems to have a slight advantage in performance.
References 1. Lee, Y.J. and Mangasarian, O. L.: RSVM: Reduced Support Vector Machines. Data Mining Institute Technical Report 0007, Computer Sciences Department, University of Wisconsin (2000)
Scalable Kernel Systems
291
2. Smola, A. J. and Bartlett, P.: Sparse Greedy Gaussian Process Regression. In: T. K. Leen, T. G. Diettrich and V. Tresp, (eds.): Advances in Neural Information Processing Systems 13 (2001) 3. Tresp, V.: The Bayesian Committee Machine. Neural Computation, Vol.12 (2000) 4. Tresp, V.: Scaling KernelBased Systems to Large Data Sets. Data Mining and Knowledge Discovery, accepted for publication 5. Wahba, G.: Spline models for observational data. Philadelphia: Society for Industrial and Applied Mathematics (1990) 6. Williams, C. K. I. and Seeger, M.: Using the Nystr¨ om Method to Speed up Kernel Machines. In: T. K. Leen, T. G. Diettrich and V. Tresp, (eds.): Advances in Neural Information Processing Systems 13 (2001)
OnLine Learning Methods for Gaussian Processes Shigeyuki Oba1 , Masaaki Sato2 , and Shin Ishii1 1
2
Nara Institute of Science and Technology, Takayama 89165, Ikoma, JAPAN
[email protected] http://mimi.aistnara.ac.jp/˜shigeo/ ATR International, Sorakugun, Kyoto, JAPAN
Abstract. This article proposes two modiﬁcations of Gaussian processes, which aim to deal with dynamic environments. One is a weight decay method that gradually forgets the old data, and the other is a time stamp method that regards the time course of data as a Gaussian process. We show experimental results when these modiﬁcations are applied to regression problems in dynamic environments.
1
Introduction
Gaussian processes (GPs) [1,2,3] provide a natural Bayesian framework for regression problems. Especially when data involve large Gaussian noise [4], the generalization performance of GP regressions outweighs those of various other function approximators. This article considers situations where data are provided to the learning system in an online manner. – At each observation, only one or small number of data are added to Dt . Dt is a data storage at time t. – Prediction p(y∗ x∗ , Dt ) for a given input x∗ is done occasionally based on the current data storage. – The environment generating data changes over time. Typically, the inputoutput relationship changes, or the input distribution changes. In such a situation, the learning system should rapidly adapt to the current environment by putting focus on the recently observed data. This article discusses a couple of GP modiﬁcations that realize such online adaptation.
2
GP Regression
There is a dataset consisting of N data points: D = (X, y), X = {xn ∈ Rd n = 1, · · · , N }, y = {yn ∈ Rn = 1, · · · , N }. A regression problem aims at achieving a predictive function y = f (x) according to the dataset. A Bayes inference for this problem is formulated as G. Dorﬀner, H. Bischof, and K. Hornik (Eds.): ICANN 2001, LNCS 2130, pp. 292–299, 2001. c SpringerVerlag Berlin Heidelberg 2001
OnLine Learning Methods for Gaussian Processes
p(f (x)D) =
p(yf (x), X)p(f (x)) . p(yX)
293
(1)
In many parametric models, like MLP or RBF, the predictive function is parameterized by a parameter w and the problem is formulated as a Bayes inference to obtain the posterior distribution of the parameter, p(wD). On the other hand, a Gaussian process (GP) approach considers a prior distribution of the prediction function, p(f (x)). Instead of directly using the prior distribution of the prediction function, we can equivalently use the prior distribution of output f = (f (x1 ), · · · , f (xN )) for the ﬁnite input set X = {x1 , · · · , xN }, which is represented by a zero mean Gaussian distribution with covariance C: 1 1 −1 exp − f C f , (2) p(f (x1 ), · · · , f (xN )C, X) = Zy 2 where Zy is a normalization term. The covariance matrix C is a positive deﬁnite function of X. Each component is deﬁned by Cij = C(xi , xj ) + δij N (xi ),
(3)
where C(xi , xj ) is a covariance function which deﬁnes the relation between two input points. N (xi ) is a noise function which deﬁnes output noise at individual input xi . We assume C(xi , xj ) and N (xi ) are parameterized by a parameter θ. Under condition (2), the conditional probability distribution of output y∗ = f (x∗ ) for a new input x∗ is given by p((y∗ , y)(x∗ , X), θ) p(y∗ x∗ , X, y, θ) = p(yX, θ) 1 1 −1 y −1 y y∗ C ∗ = −y C y , exp − y∗ Z∗ 2
(4)
where Z∗ is a normalization term and C ∗ is a covariance matrix for (X, x∗ ). C ∗ is given by Ck , k = (C(x1 , x∗ ), · · · , C(xN , x∗ )), κ = C(x∗ , x∗ ) + N (x∗ ). (5) C∗ = k κ From (4) and (5), we obtain the predictive Gaussian distribution as 1 (y∗ − E[y∗ ])2 1 p(y∗ x∗ , (X, y), θ) = exp − Z 2 var [y∗ ] E[y∗ ] = k C −1 y,
var [y∗ ] = κ − k C −1 k.
(6) (7)
Equation (3) means that the covariance matrix is the summation of the covariance function and the noise function. As the covariance function C(xi , xj ), any function form is allowed as long as the covariance matrix is positive deﬁnite. When a covariance function: p 1 2 αl (xi,l − xj,l ) , (8) C(xi , xj ; θ) = v exp − 2 l=1
294
Shigeyuki Oba, Masaaki Sato, and Shin Ishii
is used, GP can be regarded as an RBF with inﬁnite number of kernel functions [1]. Here, l is an index of the input dimension and xi,l denotes the lth component of input xi . Parameter v represents the vertical scale of the supposed GP. αl represents the inverse variance of the lth component and also indicates the contribution of the lth component. The simplest noise function is given by constant N = β, corresponding to an inputindependent noise. noise can be deﬁned by a noise
An inputdependent J function like N = exp (log β )φ (x) , where φj (x) is a kernel function. j j j=1 Under the abovementioned parameterization, the likelihood of dataset D is given by 1 1 (9) exp − y T C −1 y . P (Dθ) = 2 (log 2π)N C Using (9), parameter θ can be obtained by maximum likelihood (ML) estimation or Bayes estimation. A gradient ascent algorithm of the loglikelihood, i.e., an ML estimation, is used in our study. Parameter θ is updated after observing each datum.
3
Weight Decay Method
We are interested in a situation where the environment changes over time. In order to deal with such a dynamic environment, the learning system needs to put focus on the recent data by gradually forgetting the old data. This section discusses a method that decreases data weight. A weight value indicates the priority of the corresponding datum. 3.1
Covariance Matrix and Overlapping Data
In order to deal with weighted data, we ﬁrst consider a situation where there are w (> 1) identical data in the dataset. The following theorem states how to deal with such overlapping data. Theorem 1 Overlapping w data are equivalent to a single datum whose noise function N (x) is divided by w. Proof We assume that there are N + w data points: (x1 , y1 ), · · · , (xN , yN ), (xN +1 , yN +1 ), · · · , (xN +w , yN +w ) and the last w data points are identical; namely, xN +i = x0 and yN +i = y0 for i = 1, · · · , w. Covariance matrix for the N + w data points, C + , is given by C K . (10) C+ = K A A = N (x0 )I w + κJ w is a wbyw matrix where κ = C(x0 , x0 ), I w is a wbyw unit matrix and J w is a wbyw matrix whose components are all 1. K =
OnLine Learning Methods for Gaussian Processes
295
[k, · · · , k] is an N byw matrix consisting of N dimensional columnar vector k = [C(x1 , x0 ), · · · , C(xN , x0 )] . Using partial inverse matrix expansion, the inverse of C + is calculated as ˜ ˜ C K (11) C −1 + = ˜ A ˜ K ˜ def A = (N (x0 )I w + κJ w − K C −1 K)−1 , ˜ C ˜ def ˜ C −1 . ˜ def = −C −1 K A, = C −1 + C −1 K AK K Using K C −1 K = (k C −1 k)J w and (aI w +bJ w )−1 = ˜ = A
1 b a − a(wb+a) J w ,
we obtain
1 κ − k C −1 k Iw − J w. N (x0 ) N (x0 ){w(κ − k C −1 k) + N (x0 )}
(12)
Furthermore, using KI w = K , KJ w = wK and KK = wkk , we obtain ˜ = C −1 + λC −1 kk C −1 , ˜ = − λ C −1 K, C K w 1 def λ = . −1 κ − k C k + N (x0 )/w
(13) (14)
For a new input x∗ , the mean output E[y∗ ] is given as follows using the covariance matrix C+: −1 y N E[y∗ ] = k∗ l∗ C + , (15) yw where k∗ = [C(x1 , x∗ ) · · · C(xN , x∗ )] , l∗ = [l0 , · · · , l0 ] is a wdimensional vector consisting of l0 = C(x0 , x∗ ), y N = [y1 · · · yN ] and y w = [y0 · · · y0 ] is a wdimensional vector of y0 . From (11), we obtain ˜ N + k K ˜ y w + l Ky ˜ N + l Ay ˜ w E[y∗ ] = k∗ Cy ∗ ∗ ∗ ˜ −1 −1 = k∗ Cy N − λ(k∗ C ky0 + l0 k C y N + l0 y0 ) ˜ C −λC −1 k y N = k ∗ l0 . y0 λ −λk C −1
(16)
Similarly, var[y∗ ] is given by var[y∗ ] = κ − k∗ l0
˜ C −λC −1 k −1 λ −λk C
k∗ . l0
(17)
Equations (16) and (17) mean that w identical data can be replaced in the prediction of y∗ by a single datum. From equation (14), however, we see that the noise function for the single datum is divided by w.
296
Shigeyuki Oba, Masaaki Sato, and Shin Ishii
Fig. 1. Thick and thin circles denote data with weight 1.0 and 0.1, respectively. Dash line denotes the expected output, and upper and lower solid lines denote the error bar (expected standard deviation).
The above theorem can be straightforwardly expanded to the case where each datum has a weight value. When there are N weighted data: (x1 , y1 , w1 ), · · · , (xN , yN , wN ), its covariance matrix is given by Cij = C(xi , xj ) + δij N (xi )/wi .
(18)
Each weight value wk (k = 1, · · · , N ) is not necessarily an integer any more. Figure 1 shows an experimental result of the GP regression when some data have small weight value (0.1) while the other data have regular weight value (1.0). 3.2
OnLine GP Regression with Weight Decay
In the weight decay method proposed here, the data weight is gradually decreased in order to forget the old data. At time t, a single datum (xt , yt ) is provided to the learning system with an initial weight value 1.0. After the next time step t+1, the weight is multiplied by a decaying factor η; namely, the weight exponentially decays. When a datum is too old and its weight value becomes suﬃciently small, the corresponding entry is deleted from the covariance matrix. This can be done by applying the partial matrix inverse calculation reciprocally. Figure 2 shows an example. The target function for 0 < t ≤ 60 or 60 < t ≤ 120 is y = sin(8x) or y = − sin(8x), respectively. At each time step t, an input xt is randomly generated from a Gaussian distribution whose center is −0.5 (or 0.5) for 0 < t ≤ 30 and 60 < t ≤ 90 (or 30 < t ≤ 60 and 90 < t ≤ 120). For each input xt , target output yt is generated by the target function disturbed by Gaussian noise. Figure 2(a) shows the whole dataset. Figure 2(b) shows the regression results at t = 30, 60, 90 and 120. From the results for t = 30 and t = 60, the GP regression well approximates the target function for the region where data are observed. At t = 90, although the newly observed data in the x < 0 region contradict the older data, the regression adapts to the new data. At t = 120, the regression comes to well approximate the new target function.
OnLine Learning Methods for Gaussian Processes
297
Fig. 2. (a) Target functions and data. (b) The results of the GP regression with weight decay. Data marked by ‘∗’ denote the recent 30 data, ‘+’ denote the previous 30 data, and ‘·’ denote the older data. Each shadow line shows the target function at that time, dash line is the predicted function and two thin black lines denote the error bar.
4
Time Stamp Method
In a dynamic environment, the prediction based on too old data will be inaccurate in the current prediction. Therefore, the prediction should consider the time interval between the data observation time and the prediction time. One way to do so is that a datum explicitly involves its observation time as its component. Let a datum at time t = ti be represented as input, output and observation time, (xi , yi , ti ). For the augmented dataset, the covariance function is given by d 1 1 2 2 C((xi , ti ), (xj , tj )) = v exp − αl (xi,l − xj,l ) − αt (ti − tj ) . (19) 2 2
l=1
The learning of the new parameter αt is done similarly to that in the ordinary GP regression. This simple method is called the time stamp method. An experimental result is shown in Figure 3. The target functions and the dataset are the same as those in the previous experiment. Figure 3(b) shows how this method works. Before the target function changes (t ≤ 60), parameter αt corresponding to the time variance (‘time scale’) decreases as the learning proceeds, implying that the GP regression regards the environment as static.
298
Shigeyuki Oba, Masaaki Sato, and Shin Ishii
Fig. 3. (a) GP regression results by the time stamp method. (b) Time course of parameters. The ordinate is in log scale. The thick, dash, chain, and thin lines denote parameters αt (time scale), α1 (x scale), v (y scale), and β (noise scale), respectively.
Fig. 4. (a) Target function smoothly changes over time. (b) Result of the weight decay method at t = 120. (c) Result of the time stamp method at t = 120.
After the target function suddenly changes at t = 60, the GP regression becomes to regard the environment as noisy; this can be seen in temporal increase of v and β corresponding to the output variance (‘y scale’ and ‘noise scale’). However, what actually happens is the change of environment instead of the increase of noise. Therefore, the GP regression increases αt while decreases v and β; namely, the system becomes easier to forget the old data and puts more focus on the recent data. This increase of αt has side eﬀect. At t = 90 or t = 120 in Figure 3(a), the error bar for the region where the previous data (mark ‘+’) are provided is fairly large. This is due to the increase of the time scale. Since the time stamp method handles the observation time as a smooth GP as well as the input, it is advantageous especially when the target function smoothly changes over time. Figure 4 shows an experimental example. Although the weight decay method only follows the recent environment, the time stamp method can predict the future change considering the past environmental change. This improves the prediction along the smooth environmental change. Actually, the ﬁnal prediction at t = 120 is much better in the time stamp method than that in the weight decay method.
OnLine Learning Methods for Gaussian Processes
5
299
Concluding Remark
This article proposed two learning methods for GP, i.e., the weight decay method and the time stamp method, aiming at online adaptation to environmental changes. Naturally, these two methods can be combined in order to deal with complex environmental changes. On the other hand, online learning problems often involve increasingly large number of sample data. Conventional GP algorithms suﬀer from computational cost with square or cubic order of the sample number. Since our weight decay method employs a heuristic that deletes data whose weight values are suﬃciently small, we consider that it can be applied to problems with a fairly large number of data. However, actual applications to real problems are our future work. Recently GP modiﬁcations using sparsely allocated kernels have been proposed in order to deal with a large data set ([6], for example). Since our idea is to manipulate kernel hyperparameters in an online manner, our methods can be combined with such GP modiﬁcations.
References 1. MacKay, D. J. C.: Gaussian processes  a replacement for supervised neural networks?. Lecture notes for a tutorial at NIPS 1997. 2. Gibbs, M. N. and MacKay, D. J. C.: Eﬃcient implementation of Gaussian processes. Cavendish Laboratory, Cambridge, UK. Draft manuscript (1997), available from http://wol.ra.phy.cam.ac.uk/mackay/homepage.html. 3. Williams, C. K. I.: Prediction with Gaussian processes: from linear regression to linear prediction and beyond. Technical Report NCRG/97/012, Neural Computing Research Group, Department of Computer Science, Aston University (1997). 4. Rasmussen, C. E.: Evaluation of Gaussian processes and other methods for nonLinear regression. PhD Thesis. University of Toronto, Department of Computer Science, Toronto, Canada. http://www.cs.toronto.edu/pub/carl/thesis.ps.gz 5. Neal, R. M.: Monte Carlo implementation of Gaussian process models for Bayesian regression and classiﬁcation.: Technical Report CRGTR972, Department of Computer Science, University of Toronto (1997). 6. Csat` o, L. and Opper, M.: Sparse Representation for Gaussian Process Models. To appear in Advances in Neural Information Processing System 13 ( eds. T.K.Leen, T.G. Diettrich, V. Tresp). MIT Press (2001)
Online Approximations for WindField Models Lehel Csat´o, Dan Cornford, and Manfred Opper Neural Computing Research Group, Aston University B4 7ET Birmingham, United Kingdom {csatol,cornfosd,opperm}@aston.ac.uk
Abstract. We study online approximations to Gaussian process models for spatially distributed systems. We apply our method to the prediction of wind ﬁelds over the ocean surface from scatterometer data. Our approach combines a sequential update of a Gaussian approximation to the posterior with a sparse representation that allows to treat problems with a large number of observations.
1
Introduction
A common scenario of applying online or sequential learning methods [Saad 1998] is when the amount of data is too large to be processed by more eﬃcient oﬄine methods or there is no possibility to store the arriving data. In this article we consider the area of spatial statistics [Cressie 1991], where the data is observed at diﬀerent spatial locations and the aim is to build a global Bayesian model of the local observations based on a Gaussian Process prior distribution. Speciﬁcally, we consider scatterometer data obtained from the ERS2 satellite [Oﬃler 1994] where the aim is to obtain an estimate of the wind ﬁelds which the scatterometer indirectly measured. The scatterometer measures the radar backscatter from the ocean surface at a wavelength of approximately 5 cm. The strength of the returned signal gives an indication of the wind speed and direction, relative to scatterometer beam direction. As shown in [Stoﬀelen and Anderson 1997b] the measured backscatter behaves as a truncated Fourier expansion in relative wind direction. Thus while the wind vector to scatterometer observations map is onetoone, its inverse is onetomany [Evans et al. 2000]. This makes the retrieval of a wind ﬁeld a complex problem with multiple solutions. Nabney et al. [2000] have recently proposed a Bayesian framework for wind ﬁeld retrieval combining a vector Gaussian process prior model with local forward (wind ﬁeld to scatterometer) or inverse models. One problem with the approach outlined in [Nabney et al. 2000] is that the vector Gaussian process requires a matrix inversion which scales as n3 . The backscatter is measured over 50 × 50 km cells over the ocean and the total number of observations acquired on a given orbit can be several thousand. In this paper we show that we can produce an eﬃcient approximation to the posterior distribution of the wind ﬁeld by applying a Bayesian online learning approach [Opper 1998] to Gaussian process models following [Csat´ o and Opper G. Dorﬀner, H. Bischof, and K. Hornik (Eds.): ICANN 2001, LNCS 2130, pp. 300–307, 2001. c SpringerVerlag Berlin Heidelberg 2001
Online Approximations for WindField Models
301
2001], which computes the approximate posterior by a single sweep through the data. The computational complexity is further reduced by constructing a sparse sequential approximate representation to the posterior process.
2
Processing Scatterometer Data
Scatterometers are commonly used to retrieve wind vectors over ocean surfaces. Current methods of transforming the observed values (scatterometer data, denoted as vector s or s i at a given spatial location) into wind ﬁelds can be split into two phases: local wind vector retrieval and ambiguity removal [Stoﬀelen and Anderson 1997a] where one of the local solutions is selected as the true wind vector. Ambiguity removal often uses external information, such as a Numerical Weather Prediction (NWP) forecast of the expected wind ﬁeld at the time of the scatterometer observations. We are seeking a method of wind ﬁeld retrieval which does not require external data. In this paper we use a mixture density network (MDN) [Bishop 1995] to model the conditional dependence of the local wind vector z i = (ui , vi ) on the local scatterometer observations s i : pm (zzi ssi , ω ) =
4
βij φ(zzi ccij , σij )
(1)
j=1
where ω is used to denote the parameters of the MDN, φ is a Gaussian distribution with parameters functions of ω and s i . The parameters of the MDN are determined using an independent training set [Evans et al. 2000] and are considered known in this application. The MDN which has four Gaussian component densities captures the ambiguity of the inverse problem. In order to have a global model from the localised wind vectors, we have to combine them. We use a zeromean vector GP to link the local inverse models [Nabney et al. 2000]: N pm (zzi ssi , ω )p(ssi ) W 0) p0 (zzW q(zz) ∝ (2) W 0i ) pG (zzi W i
where z = [zz1 , . . . , z N ]T is the concatenation of the local wind ﬁeld components, W 0 (xi , xj )}ij=1,...,N is the prior covariance matrix for the vector z (deW 0 = {W pendent on the spatial location of the wind vectors), and pG is p0 marginalised at zi , a zeromean Gaussian with covariance W 0i . The choice of the kernel function W 0 (x, y) fully speciﬁes our prior beliefs about the model. Notice also that for any given location we have a twodimensional wind vector, thus the output of the kernel function is a 2×2 matrix, details can be found in [Nabney et al. 2000]. The link between two diﬀerent wind ﬁeld directions is made through the kernel function – the larger the kernel value, the stronger the “coupling” between the two corresponding wind ﬁelds is. The prior Gaussian process is tuned carefully to represent features seen in real wind ﬁelds.
302
Lehel Csat´ o, Dan Cornford, and Manfred Opper
Since all quantities involved are Gaussians, we could, in principle, compute the resulting probabilities analytically, but this computation is practically intractable: the number of mixture elements from q(zz) is 4N , extremely high even for moderate values of N. Instead, we will apply the online approximation of [Csat´o and Opper 2001] to have a jointly Gaussian approximation to the posterior at all data points. However, we know that the posterior distribution of the wind ﬁeld given the scatterometer observations is multimodal, with in general two dominating and well separated modes. We might thus expect that the online implementation of the Gaussian process will track one of these posterior modes. Results show that this is indeed the case, although the order of the insertion of the local observations appears to be important.
3
Online Learning for the Vector Gaussian Process
Gaussian processes belong to the family of Bayesian [Bernardo and Smith 1994] models. However, contrary to the ﬁnitedimensional case, here the “model parameters” are continuous: the GP priors specify a Gaussian distribution over a function space. Due to the vector GP, the kernel function W 0 (x, y) is a 2 × 2 matrix, specifying the pairwise crosscorrelation between wind ﬁeld components at diﬀerent spatial positions. Simple moments of GP posteriors (which are usually non Gaussian) have a parametrisation in terms of the training data [Opper and Winther 1999] which resembles the popular kernelrepresentation [Kimeldorf and Wahba 1971]. For all spatial locations x the mean and covariance function of the vectors z x ∈ R2 are represented as N zzx = i=1W 0 (x, xi ) · αz (i) (3) N cov(zzx , z y ) = W 0 (x, y) + i,j=1W 0 (x, xi ) · Cz (ij) · W 0 (xj , y) Cz (ij)}i,j=1,N are parameters which will be where αz (1), αz (2), . . . , αz (N) and {C updated sequentially by our online algorithm. Before doing so, we will (for numerical convenience) represent the vectorial process by a scalar process with twice the number of observations, i.e. we set fxu ) K0 (xu , yu ) K0 (xu , yv ) and W 0 (x, y) = (4) zzx = fxv K0 (xv , yu ) K0 (xv , yv ) and write (ignoring the superscripts) 2N fx = i=1 K0 (x, xi )α(i) 2N cov(fx , fy ) = K0 (x, y) + i,j=1 K0 (x, xi )C(ij)K0 (xj , y)
(5)
where α = [α1 , . . . , α2N ]T and C = {C(ij)}i,j=1,...,2N are rearrangements of the parameters from eq. (3).
Online Approximations for WindField Models 2N
303
2N
[t+1]
K0
[t+1]
I2
2t
K0
2t
I 2N 2t
2t+2
2t
2t+2
Fig. 1. Illustration of the elements used in the update eq. (6).
The online approximation for GP learning [Csat´ o and Opper 2001] approximates the posterior by a Gaussian at every step. For a new observation s t+1 , the previous approximation to the posterior qt (zz) together with a local ”likelihood” factor (from eq. (2)) pm (zzt+1 sst+1 , ω )p(sst+1 ) W 0,t+1 ) pG (zzt+1 W are combined into a new posterior using Bayes rule. Computing its mean and covariance enable us to create an updated Gaussian approximation qt+1 (zz) at the next step. q ^ (zz) = qN+1 (zz) is the ﬁnal result of the online approximation. This process can be formulated in terms of updates for the parameters α and C which determine the mean and covariance: ∂ ln g(zzt+1 ) ∂zzt+1 [t+1] [t+1] with v t+1 = C tK 0 + I2 2 ∂ ln g(zzt+1 ) T = C t + v t+1 v t+1 ∂zzt+1 2
α t+1 = α t + v t+1 C t+1
[t+1]
[t+1]
and I 2 are shown in Fig. 1 and pm (zzt+1 sst+1 , ω )p(sst+1 ) g(zzt+1 ) = W 0,t+1 ) pG (zzt+1 W zt+1 ) qt (z
with elements K 0
(6)
(7)
and zzt+1 is a vector, implying vector and matrix quantities in (6). Function g(zzt+1 ) is easy to compute analytically because it just requires the two dimensional marginal distribution of the process at the observation point s t+1 . Fig. 2 shows the results of the online algorithm applied on a sample wind ﬁeld, details can be found in the ﬁgure caption. 3.1
Obtaining Sparsity in Wind Fields
Each timestep the number of nonzero parameters will be increased in the update equation. This forces us to use a further approximation which reduces the number of supporting examples in the representations eq. (5) to a smaller set of basis vectors. Following our approach in [Csat´ o and Opper 2001] we remove the last data element when a certain score (deﬁned by the feature space geometry
304
Lehel Csat´ o, Dan Cornford, and Manfred Opper NWP Prediction
Most frequent online result
(a)
(b)
Symmetric solution
Bad solution
(c)
(d)
Fig. 2. The NWP wind ﬁeld estimation (a), the most frequent (b) and the second most frequent (c) online solution together with a bad solution. The assessment of good/bad solution is based on the value of the relative weight from Section 3.2. The grayscale background indicates the model conﬁdence (Bayesian errorbars) in the prediction, darker shade meaning more conﬁdence.
associated to the kernel K 0 ) suggests that the approximation error is small. The remaining parameters are readjusted to partly compensate for the removal as: ^ = α (t) − Q ∗q ∗(−1)α ∗ α ^ = Q (t) − Q∗q∗(−1)Q∗T Q ^ =C C
(t)
∗ ∗(−1) ∗ ∗(−1)
+Q q
c q
(8) ∗T
Q
∗ ∗( −1)
−Q q
∗T
C
∗ ∗(−1)
−C q
∗T
Q
where Q −1 = {K0 (xi , xj )}ij=1,...,2N is the inverse of the Gram matrix, the eleα∗ , q ∗ and C ∗ are twobytwo matrices). ments being shown in Fig. 3 (α The presented update is optimal in the sense that the posterior means of the process at data locations are not aﬀected by the approximation [Csat´o and
Online Approximations for WindField Models
α t+1 (t)
α
α*
C t+1 (t)
305
Q t+1 *
C
C
C* T
c*
(t)
*
Q
Q
Q* T
q*
Fig. 3. Decomposition of model parameters for the update equation (8).
Opper 2001]. The change of the mean at the location to be deleted is used as a score which measures the loss. This change is (again, very similar to the results q∗ )−1α ∗ (the from [Csat´ o and Opper 2001]) measured using the score ε = (q parameters of the vector GP can have any order, we can compute the score for every spatial location). Removing the data locations with low score sequentially leaves only a small set of socalled basis points upon which all further prediction will depend. Our preliminary results are promising: Fig. 4 shows the resulting wind ﬁeld if 85 of the spatial knots are removed from the presentation eq. (5). On the righthand side the evolution of the KLdivergence and the sumsquared errors in the means between the vector GP and a trimmed GP using eq. (8) are shown. as a function of the number of deleted points. Whilst the approximation of the posterior variance decays fast, the the predictive mean is fairly reliable against deleting. 3.2
Measuring the Relative Weight of the Approximation
An exact computation of the posterior would lead to a multimodal distribution of wind ﬁelds at each datapoint. This would correspond to a mixture of GPs as a posterior rather than to a single GP that is used in our approximation. If the individual components of the mixture are well separated, we may expect that our online algorithm will track modes with signiﬁcant underlying probability mass to give a relevant prediction. However, this will depend on the actual sequence of datapoints that are visited by the algorithm. To investigate the variation of our wind ﬁeld prediction with the data sequence, we have generated many random sequences and compared the outcomes based on a simple approximation for the relative mass of the multivariate Gaussian component. ^ ) at a separated Assuming an online solution of the marginal distribution (^ z, Σ mode, we have the posterior at the local maximum expressed: −2N/2
q(^ z ) ∝ γl (2π)
^ −1/2 Σ
(9)
with q(^ z ) from eq. (2), γl the weight of the component of the mixture to which our online algorithm had converged, and we assume the local curvature is also ^. well approximated by Σ
306
Lehel Csat´ o, Dan Cornford, and Manfred Opper 85% removed
Error measures 5 4.5
Squared dist. of the means KL−divergence
4 3.5 3 2.5 2 1.5 1 0.5 0
(a)
10
20
30
40
50
60
70
80
90
100
(b)
Fig. 4. (a) The predicted wind ﬁelds when 85% of the modes has been removed (from Fig. 2). The prediction is based only on basis vectors (circles). The model conﬁdence is higher at these regions. (b) The diﬀerence between the full solution and the approximations using the squared diﬀerence of means (continuous line) and the KLdistance (dashed line) respectively.
^ 1 ) and (^ ^ 1 ), we ﬁnd from eq (9) Having two diﬀerent online solutions (^ z1, Σ z1, Σ that the proportion of the two weights is given by ^ 1 1/2 γ1 q(^ z 1 )Σ = ^ 2 1/2 γ2 q(^ z 2 )Σ
(10)
This helps us to estimate, up to an additive constant, the “relative weight” of the wind ﬁeld solutions, helping us to assess the quality of the approximation we arrived at. Results, using multiple runs on a wind ﬁeld data conﬁrm this expectation, the correct solution (Fig. 2.b) has large value and high frequency if doing multiple runs.
4
Discussion
In the wind ﬁeld example the online and sparse approximation allows us to tackle much larger wind ﬁelds than previously possible. This suggests that we will be able to retrieve wind ﬁelds using only scatterometer observations, by utilising all available information in the signal. Proceeding with the removal of the basis points, it would be desirable to have an improved update for the vector GP parameters that leads to a better estimation of the posterior kernel (thus of the Bayesian errorbars). At present we obtain diﬀerent solution for diﬀerent ordering of the data. Future work might seek to build an adaptive classiﬁer that works on the family of online solutions and utilising the relative weights.
Online Approximations for WindField Models
307
However, a more desirable method would be to extend our online approach to mixtures of GPs in order to incorporate the multimodality of the posterior process in a principled way.
Acknowledgement This work was supported by EPSRC grant no. GR/M81608.
References [Bernardo and Smith 1994] Bernardo, J. M. and A. F. Smith (1994). Bayesian Theory. John Wiley & Sons. [Bishop 1995] Bishop, C. M. (1995). Neural Networks for Pattern Recognition. New York, N.Y.: Oxford University Press. [Cressie 1991] Cressie, N. A. (1991). Statistics for Spatial Data. New York: Wiley. [Csat´ o and Opper 2001] Csat´ o, L. and M. Opper (2001). Sparse representation for Gaussian process models. In T. K. Leen, T. G. Diettrich, and V. Tresp (Eds.), NIPS, Volume 13. The MIT Press. http://www.ncrg.aston.ac.uk/Papers. [Evans et al. 2000] Evans, D. J., D. Cornford, and I. T. Nabney (2000). Structured neural network modelling of multivalued functions for wind retrieval from scatterometer measurements. Neurocomputing Letters 30, 23–30. [Kimeldorf and Wahba 1971] Kimeldorf, G. and G. Wahba (1971). Some results on Tchebycheﬃan spline functions. J. Math. Anal. Applic. 33, 82–95. [Nabney et al. 2000] Nabney, I. T., D. Cornford, and C. K. I. Williams (2000). Bayesian inference for wind ﬁeld retrieval. Neurocomputing Letters 30, 3–11. [Oﬃler 1994] Oﬃler, D. (1994). The calibration of ERS1 satellite scatterometer winds. Journal of Atmospheric and Oceanic Technology 11, 1002–1017. [Opper 1998] Opper, M. (1998). A Bayesian approach to online learning. See Saad [1998], pp. 363–378. [Opper and Winther [1999] Opper, M. and O. Winther (1999). Gaussian processes and SVM: Mean ﬁeld results and leaveoneout estimator. In A. Smola, P. Bartlett, B. Sch¨ olkopf, and C. Schuurmans (Eds.), Advances in Large Margin Classiﬁers, pp. 43–65. Cambridge, MA: The MIT Press. [Saad [1998] Saad, D. (1998). OnLine Learning in Neural Networks. Cambridge Univ. Press. [Stoﬀelen and Anderson [1997a] Stoﬀelen, A. and D. Anderson (1997a). Ambiguity removal and assimiliation of scatterometer data. Quarterly Journal of the Royal Meteorological Society 123, 491–518. [Stoﬀelen and Anderson [1997b] Stoﬀelen, A. and D. Anderson (1997b). Scatterometer data interpretation: Estimation and validation of the transfer function CMOD4. Journal of Geophysical Research 102, 5767–5780.
Fast Training of Support Vector Machines by Extracting Boundary Data Shigeo Abe and Takuya Inoue Graduate School of Science and Technology, Kobe University Rokkodai, Nada, Kobe Japan {abe,tinoue}@eedept.kobeu.ac.jp http://www.eedept.kobeu.ac.jp/eedept english.html Abstract. Support vector machines have gotten wide acceptance for their high generalization ability for real world applications. But the major drawback is slow training for classiﬁcation problems with a large number of training data. To overcome this problem, in this paper, we discuss extracting boundary data from the training data and train the support vector machine using only these data. Namely, for each training datum we calculate the Mahalanobis distances and extract those data that are misclassiﬁed by the Mahalanobis distances or that have small relative diﬀerences of the Mahalanobis distances. We demonstrate the eﬀectiveness of the method for the benchmark data sets.
1
Introduction
Support vector machines are based on the theoretical learning theory developed by Vapnik [1], [2], [3, pp. 47–61]. In support vector machines, an nclass problem is converted into n twoclass problems in which one class is separated from the remaining classes. For each twoclass problem, the original input space is mapped into the high dimensional dot product space called feature space and in the feature space, the optimal hyperplane that maximizes the generalization ability from the standpoint of the VC dimension is determined. The high generalization ability compared to other methods has been shown for many applications but the major problem is slow training especially when the number of training data is large. Therefore, many methods for speeding up training have been proposed [2]. If support vectors are known in advance, training of support vector machines can be accelerated using only those data as the training data. Thus, in this paper we calculate the Mahalanobis distances for each data and estimate, as candidates of the boundary data, the training data that are misclassiﬁed by the Mahalanobis distances or that have small relative diﬀerences of the Mahalanobis distances. Finally, using two benchmark data sets we demonstrate the speedup of training by the proposed method.
2
Architecture of Support Vector Machines
Let mdimensional inputs xi (i = 1, . . . , M ) belong to Class 1 or 2 and the associated labels be yi = 1 for Class 1 and −1 for Class 2. If these data are G. Dorﬀner, H. Bischof, and K. Hornik (Eds.): ICANN 2001, LNCS 2130, pp. 308–313, 2001. c SpringerVerlag Berlin Heidelberg 2001
Fast Training of Support Vector Machines by Extracting Boundary Data
309
linearly separable, we can determine the decision function: D(x) = wt x + b,
(1)
where w is an mdimensional vector, b is a scalar, and yi (wt xi + b) ≥ 1 for i = 1, . . . , M.
(2)
The hyperplane D(x) = wt x + b = c for − 1 < c < 1 forms a separating hyperplane that separates xi (i = 1, . . . , M ). The distance between the separating hyperplane and the training datum nearest to the hyperplane is called the margin. The hyperplane D(x) = 0 with the maximum margin for −1 < c < 1 is called the optimal separating hyperplane. Now consider determining the optimal separating hyperplane. The Euclidean distance from a training datum x to the separating hyperplane is given by D(x)/w. Thus assuming the margin δ, all the training data must satisfy yk D(xk ) ≥δ w
for k = 1, . . . , M.
(3)
If w is a solution, aw is also a solution where a is a scalar. Thus we impose the following constraint: δ w = 1. (4) From (3) and (4), to ﬁnd the optimal separating hyperplane, we need to ﬁnd w with the minimum Euclidean norm that satisﬁes (2). The data that satisfy the equality in (2) are called support vectors. Now the optimal separating hyperplane can be obtained by minimizing 1 w2 2
(5)
with respect to w and b subject to the constraints: yi (wt xi + b) ≥ 1
for
i = 1, . . . , M.
(6)
The number of variables for the convex optimization problem given by (5) and (6) is the number of features plus 1: m + 1. We convert (5) and (6) into the equivalent dual problem whose number of variables is the number of training data. First we convert the constrained problem given by (5) and (6) into the unconstrained problem: Q(w, b, α) =
M 1 t w w− αi {yi (wt xi + b) − 1}, 2 i=1
(7)
where α = (α1 , . . . , αM )t is the Lagrange multiplier. The optimal solution of (7) is given by the saddle point where (7) is minimized with respect to w and b and
310
Shigeo Abe and Takuya Inoue
it is maximized with respect to αi (≥ 0). Then, we obtain the following dual problem. Namely, maximize Q(α) =
M i=1
αi −
M 1 αi αj yi yj xti xj 2 i, j = 0
(8)
with respect to αi subject to the constraints M
yi αi = 0, αi ≥ 0
for i = 1, .., M.
(9)
i=1
Solving (8) and (9) for αi (i = 1, . . . , M ), we can obtain the support vectors for Classes 1 and 2. Then the optimal hyperplane is placed at the equal distances from the support vectors for Classes 1 and 2. To allow the data that do not have the maximum margin to exist, we introduce the nonnegative slack variables into (2). The resulting optimization problem is similar to the above formulation. The diﬀerence is the addition of the upper bound C for αi . If the original input x are not suﬃcient to guarantee linear separability of the training data, the obtained classiﬁer may not have high generalization ability although the hyperplanes are determined optimally. Thus to enhance linear separability, in the support vector machines, the original input space is mapped into a highdimensional dot product space called feature space using the kernel function that satisﬁes Mercer’s condition. The kernel functions used in this paper are 1) polynomials with the degree of d: H(x, x ) = (xt x + 1)d , and 2) radial basis functions: H(x, x ) = exp(−γ x − x ).
3
Speedingup Training by Extracting Boundary Data
According to the architecture of the support vector machine, only the training data that are near the boundaries are necessary. In addition, since the training time becomes longer as the number of training data increases, the training time is shortened if the data that are far from the boundary are deleted. Therefore, if we can delete unnecessary data from the training data eﬃciently prior to training, we can speed up the training. In the following, we estimate the data that are near the boundaries using the classiﬁer based on the Mahalanobis distance [4] and extracting the misclassiﬁed data and the data that are near the boundaries. 3.1
Approximation of Boundary Data
The decision boundaries of the classiﬁer using the Mahalanobis distance are expressed by the polynomials, of the input variables, with the degree of two. Therefore, the boundary data given by the classiﬁer are supposed to well approximate the boundary data for the support vector machine, especially with the polynomials with the degree of two as kernel functions.
Fast Training of Support Vector Machines by Extracting Boundary Data
311
For the class i data x, the Mahalanobis distance di (x) is given by d2i (x) = (ci − x)t Q−1 i (ci − x),
(10)
where ci and Qi are the center vector and the covariance matrix for the data belonging to class i, respectively: 1 x, Xi  x ∈Xi 1 (x − ci ) (x − ci )t . Qi = Xi  ci =
(11) (12)
x ∈Xi
Here, Xi denotes the set of data belonging to class i and Xi  is the number of data in the set. The data x is classiﬁed into the class with the minimum Mahalanobis distance. The most important feature of the Mahalanobis distance is that it is invariant for linear transformation of input variables. Therefore, we do not worry about the scaling of each input variable. For the datum belonging to class i, we check whether r(x) =
min
j=i,j=1,...,n
dj (x) − di (x)
di (x)
≤η
(13)
is satisﬁed, where r(x) is the relative diﬀerence of distances, η (> 0) controls the nearness to the boundary. If r(x) is negative, the datum is misclassiﬁed. We assume the misclassiﬁed data are near the decision boundary. Inequality (13) is satisﬁed when the second minimum Mahalanobis distance is shorter than or equal to (1 + η) di (x) when the datum is correctly classiﬁed. In extracting boundary data, we set some appropriate value to η and for each class we select the boundary data that are at least equal to or more than the prespeciﬁed minimum number Nmin and that are equal to or smaller than the maximum number Nmax . Here the minimum number is set so that the number of boundary data is not too small for some classes because the data that satisfy (13) are scarce. The maximum number is set not to allow too many data to be selected. The general procedure for extracting boundary data is as follows. 1. Calculate the centers and covariance matrices for all the classes using (11) and (12). 2. For the training datum x belonging to class i, we calculate r(x) and we put the data into the stack for class i, Si , whose elements are sorted in the increasing order of the value of r(x) and whose maximum length is Nmax . We iterate this for all the training data. 3. If the stack Si includes more than Nmin data that satisfy (13), we select these data as the boundary data for class i. Otherwise, we select the ﬁrst Nmin data as the boundary data.
312
3.2
Shigeo Abe and Takuya Inoue
Performance Evaluation
Although the performance varies as kernels vary, the polynomial kernels with the degree of two performed relatively well. Thus in the following, unless otherwise stated, we use the polynomials with the degree of two as the kernel functions in evaluating the iris data [5] and blood cell data [6]. We ran the software developed by Royal Holloway, University of London [7] on a SUN UltraSPARCIIi (335MHz) workstation. The software used the pairwise classiﬁcation [8] to resolve unclassiﬁed regions that arise by the original twoclass formulation. Iris Data. Since the number of the iris data is small, we checked only the lowest rankings, in the relative diﬀerence of the Mahalanobis distances, of support vectors for the pairs of classes. Table 1 lists the results when the boundary data were extracted for each class. The numeral in the ith row and the jth column shows the lowest ranking of the support vector, belonging to class i, that separate class i from class j. The diagonal elements show the number of training data for the associated class. The maximum value among lowest rankings was 8, which was smaller than half the number of class data. Thus, the relative diﬀerence of the Mahalanobis distances well reﬂected the boundary data. Table 1. The lowest rankings of support vectors for the iris data Class 1 1 2 3
2
3
(25) 1 2 8 (25) 3 2 3 (25)
Blood Cell Data. We set Nmax as the half of the maximum number of class data, namely 200. And we set Nmin = 50 and evaluated the performance changing η. Table 2 lists the results for the blood cell data. When η ≥ 1, suﬃciently good recognition rates were obtained for the test data and training was speeded up two to three times. (The numerals in the brackets in the “Rates” column show the recognition rates of the training data.) Table 3 lists the speedup of training for diﬀerent kernels when η = 2.0. For each kernel, the upper row shows the results using all the training data and the lower row shows the results using the extracted boundary data. For diﬀerent kernels, training was speeded up about two times and the recognition rates of the test data were almost the same.
4
Conclusions
We discussed fast training of support vector machines by extracting boundary data that are determined by the relative diﬀerences of the Mahalanobis distances.
Fast Training of Support Vector Machines by Extracting Boundary Data
313
Table 2. Performance for the blood cell data η
Data Rates (%)
Time (s) Speedup
0.5 1.0 1.5 2.0 —
1136 1693 1978 2102 3097
96 (2) 266 (2) 390 (2) 448 (2) 924
90.81 92.06 92.10 92.13 92.13
(97.45) (99.61) (99.29) (99.29) (99.32)
9.4 3.4 2.4 2.1 1
Table 3. Performance for the blood cell data for diﬀerent kernels (η = 2.0) Kernel
Parameter Rates (%)
Polynomial d = 3 d=4 RBF
γ=1 γ = 0.1
91.94 92.00 92.10 92.10 92.13 92.13 92.16 92.13
(99.94) (99.81) (100) (99.90) (100) (99.97) (100) (99.97)
Time (s) Speedup 937 461 948 471 2736 1331 2799 1387
1 2.0 1 2.0 1 2.1 1 2.0
The computer simulations using the iris data and blood cell data showed that by this method the boundary data were eﬃciently extracted and training was speeded up about two times for the blood cell data.
References 1. V. N. Vapnik. Statistical Learning Theory. John Wiley & Sons, 1998. 2. B. Sch¨ olkopf, C. J. C. Burges, and A. J. Smola, editors. Advances in Kernel Methods: Support Vector Learning. The MIT Press, 1999. 3. S. Abe. Pattern Classiﬁcation: Neurofuzzy Methods and Their Comparison. SpringerVerlag, 2001. 4. R. O. Duda and P. E. Hart. Pattern Classiﬁcation and Scene Analysis. John Wiley & Sons, 1973. 5. R. A. Fisher. The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7:179–188, 1936. 6. A. Hashizume, J. Motoike, and R. Yabe. Fully automated blood cell diﬀerential system and its application. In Proc. IUPAC Third International Congress on Automation and New Technology in the Clinical Laboratory, pages 297–302, 1988. 7. C. Saunders, M. O. Stitson, J. Weston, L. Bottou, B. Sch¨ olkopf, and A. Smola. Support vector machine reference manual. Technical Report CSDTR9803, Royal Holloway, University of London, 1998 (http://svm.cs.rhbnc.ac.uk/). 8. U. H.G. Kreßel. Pairwise classiﬁcation and support vector machines. In B. Sch¨ olkopf, C. J. C. Burges, and A. J. Smola, editors, Advances in Kernel Methods: Support Vector Learning, pages 255–268. The MIT Press, 1999.
Multiclass Classiﬁcation with Pairwise Coupled Neural Networks or Support Vector Machines Eddy Nicolas Mayoraz Human Interface Lab (HIL), Motorola Labs 3145 Porter Drive, Palo Alto, CA 94304, USA
[email protected] Abstract. Support Vector Machines (SVMs) are traditionally used for multiclass classiﬁcation by introducing for each class one SVM trained to distinguish the associated class from all the others. In a recent experiment, we attempted to solve a Kclass problem using a similar decomposition with K feedforward binary neural networks. The disappointing results were explained by the fact that neural networks suﬀer from datasets with a strongly unbalanced class distribution. By opposition to oneperclass, pairwise coupling introduces one binary classiﬁer for each pair of classes and does not degrade the original class distribution. A few papers report evidences that pairwise coupling gives better results for SVMs than oneperclass. This issue is revisited in this paper where oneperclass class and pairwise coupling decomposition schemes used with both, SVMs and neural networks, are compared on a real life problem. Various methods for aggregating the results of pairwise classiﬁers are experimented. Beside our online handwriting application, experiments on some databases of the Irvine repository are also reported.
1
Introduction
The context of this work is supervised learning for automatic classiﬁcation. Formally, the problem consists in ﬁnding an approximation Fˆ of an unknown labelling function F deﬁned from an input space Ω onto an unordered set of labels 1, . . . , K}, given a training set : T = {(xn , F (xn ))}N n=1 . F deﬁnes a Kpartition of the input space into sets F −1 (k) called classes and denoted ωk . Among the wide variety of methods available to solve such a problem, some of them, e.g. perceptron, polynomial classiﬁers, support vector machine (SVM), are restricted to the discrimination between two classes only. Other algorithms can handle more than two classes but some do not scale up eﬃciently with the size of the training set or with the number of classes. As many real applications translate into classiﬁcation problems with a large number of classes and a huge number of data, several techniques have been proposed to decompose a Kclass classiﬁcation problem into a series of smaller 2class problems. Many studies have demonstrated that even when using an approach which can deal with large scale problems, an adequate decomposition of the classiﬁcation problem into subproblems can be favorable to the overall computational complexity as well as to the generalization ability of the global classiﬁer [8,3,19,7]. G. Dorﬀner, H. Bischof, and K. Hornik (Eds.): ICANN 2001, LNCS 2130, pp. 314–321, 2001. c SpringerVerlag Berlin Heidelberg 2001
Multiclass Classiﬁcation with Pairwise Coupled NN or SVM
315
Several decomposition schemes have been proposed in the literature to transform a Kclass partitioning F of Ω into a series f1 , . . . , fL of L bipartitions or dichotomies fl : Ω → {−1, +1}. A reconstruction method is coupled with each decomposition scheme for the selection of one of the K classes, given the answers of all dichotomies on a particular input data. The simplest decomposition schemes are oneperclass (OPC) and pairwise coupling (PWC). A Kclass classiﬁcation problem is decomposed by the former method into K dichotomies, each of which separating one class from all the others. The latter will use K(K − 1)/2 dichotomies, one for each pair of classes, the dichotomy for the pair (ωk , ωk ) focuses on the discrimination between classes ωk and ωk , ignoring all other classes. More elaborate schemes are proposed in [6,10], where redundancy is exploited in the decomposition as a way of increasing the errorcorrecting capability of the reconstruction. With OPC, the number of dichotomies is small, but each of them involves the whole training data. Moreover, the positive and the negative data are represented in a very unbalanced way in each dichotomy, which can be a problem for some classiﬁcation methods. In PWC, each dichotomy has a balanced training sample whose size is in average K 2 times smaller than for OPC, but the number of K−1 dichotomies is 2 times larger. In terms of computational time, the latter situation is favorable as most learning methods are superlinear in the training size. Moreover, it has been shown in diﬀerent studies that if K is large, one can restrict ourselves to a small subset of the K(K − 1)/2 dichotomies without loosing much in accuracy [9]. This work presents a comparison of OPC and PWC decomposition schemes for both, SVMs and feedforward neural networks in a real application, namely online Roman handwriting recognition. Diﬀerent reconstruction schemes for PWC are evaluated. Beside handwriting recognition, the comparison is carried out on benchmark datasets of very diﬀerent characteristics.
2
Pairwise Coupling Decomposition and Reconstruction
The application of a decomposition/reconstruction scheme can be formally expressed as decomposing the sought approximation Fˆ of the unknown function F : Ω → {ω1 , . . . , ωK } as follows: fˆ σ m Ω −→ RL −→ RL −→ RK
arg max
−→
{1, . . . , K}
(1)
where fˆl , l = 1, . . . , L is the hypothesis yielded by the lth classiﬁer learning the lth dichotomy of the decomposition, and m is usually a linear mapping expressed by a matrix M ∈ RL×K transforming the outputs of the L classiﬁers into K values from which a voting rule will pick the ﬁnal class label. Some binary classiﬁers yield a Boolean output, others provide a real output that can express either a conﬁdence, a distance, or a probability measure. Whenever the output is continuous, it can eventually be passed through a nonlinear function, either to bound the values into a ﬁx range, or to get a Boolean value. This is the role of
316
Eddy Nicolas Mayoraz
σ that is applied componentwise and which is typically the identical function, a sigmoidal function, or the sign function. Some papers discuss decomposition/reconstruction methods in a context where the outputs of the classiﬁers are interpreted as probabilities, while others deal with positive and negative values. The two are equivalent through aﬃne transform. Without loss of generality, it will be assumed hereafter that the outputs of fˆl are centered around 0, the sign function gives values in {−1, 0, +1} and whenever σ is a sigmoidal function, tanh is meant. The decomposition, specifying the task of each dichotomy, can be conveniently described by a matrix D ∈ {−1, 0, +1}L×K , where Dlk = ±1 means ωk ⊂ fl−1 (±1), and Dlk = 0 indicates that ωk is ignored by the dichotomy l. This decomposition matrix also provides a simple and natural choice for the reconstruction matrix M . For OPC, D has +1s on the diagonal and −1s everywhere else and M = D produces the same rule than the identity matrix. When no a priori information on the classes is available, it is the only reasonable choice for M . For PWC and σ = sgn, the reconstruction M = D corresponds to the following rule : for each class k, look only at the dichotomies involving ωk , count how many of these classiﬁers decide for ωk , and select the class with the highest score (ties are broken at random). This reconstruction rule is the most widely used in literature on PWC and will be referred to as hard voting. By opposition, soft voting denotes the case where M = D and σ = tanh (or σ = identity). A last version found in [9] is a compromise between the hard and the soft voting and will be refer to as the fair voting. The hard voting is applied primarily, and the soft voting is used only to break ties. The author reports that for large K, the chances of ties are rare, thus we expect that the fair rule will give almost the same results as the hard voting rule. In [14], yet another rule is proposed, which is better expressed in probabilistic terms. If for a given sample x ∈ Ω, the output pˆkl of the classiﬁer discriminating between classes ωk and ωl is in [0, 1] and is interpreted as the probability of x ∈ ωk given that x ∈ ωk ∪ ωl . If pˆk gives the posterior probability of class ωk for sample x, then we have ∀k = l, pˆkl = pˆkpˆ+kpˆl . Resolving this system of equations for pˆk one derives the following rule: 1 Fˆ (x) = arg max pˆk = arg min . k k pˆkl l=k
This method has been included in the experiments and is referred to as the probabilistic rule.
3
Feedforward Neural Networks and Support Vector Machines
It is a well known theoretical fact that multilayered feedforward neural networks (MLPs) used for classiﬁcation are approximating a posteriori probabilities when trained to optimize the mean square error or the crossentropy functions [18,2].
Multiclass Classiﬁcation with Pairwise Coupled NN or SVM
317
However, this is assuming ideal conditions such as: the network has enough parameters, it is trained to converge to a global optimum, the training session has inﬁnitely many data and the a priori probabilities of the test dataset are correctly reﬂected in the training dataset. In practice, the probabilistic interpretation of neural network outputs has to be made with a lot of care. Considering the general form for a decomposition/reconstruction scheme presented in equation (1), the analogy with a last layer of an MLP used as Kclass classiﬁer is obvious. If in turns the classiﬁers solving the dichotomies are also MLPs of say depth h, the overall structure is an MLP of depth h + 1 with a last hidden layer made of L units having an activation function σ. The most important diﬀerence is that the units of the last hidden layer are trained to perform some tasks speciﬁed ahead of time. So, one could argue that this approach will never perform better than training a single larger MLP. Such argument are invalidated mostly by two considerations. First, the learning in an MLP is suboptimal, specially in a deep MLP. Second, a large MLP is more likely to overﬁt the data than a similar architecture with constraints hidden units. The most famous applications of SVMs for multiclass classiﬁcation have been using the OPC decomposition scheme [16]. Some works report experiments of SVMs using PWC [15,9,13,17]. A potential problem occurring when using SVMs in any decomposition scheme is due to the unnormalized output ranges of each SVM [11]. This is a potentially serious problem when OPC decomposition is used, as the ﬁnal voting operates usually on the outputs of the SVMs directly (i.e. σ in (1) is the identity function). With PWC, it is essential to use a saturating (nonlinear) function as σ such as sgn or tanh, otherwise the approach breaks down completely.
4
Empirical Evaluations and Comparisons
This work was primarily carried out in the context of handwriting recognition, using some of Motorola’s databases for online Roman handwriting. In order to measure and compare these diﬀerent approaches in various contexts, we report also some numerical experimentations on smaller, widely used datasets from the Irvine Machine Learning Repository [12]. 4.1
Databases
From the Irvine repository, we selected databases of small size (given that the other inhouse databases used were large), without missing data and with different characteristics: glass, segmentation, ecoli, yeast and vowel. The handwriting databases used in this work represent online handwritten isolated Roman characters, from diﬀerent writers, from diﬀerent countries. They result from a preprocessing yielding 19 attributes which were also centered and normalized. Three diﬀerent benchmarks are considered: dg has 91K data equally distributed among 10 classes (10 digits), and AZ, respective az, contains 200K (resp. 250K) data equally distributed in the 26 classes of uppercase (resp. lowercase) alphabet. Our prior knowledge on these benchmarks is that dg is much
318
Eddy Nicolas Mayoraz
easier that the other two, and the lowercase problem is slightly more complex than the uppercase one. 4.2
Experiments
The experiments one the Irvine databases were done according to a repetition of ﬁve 2fold crossvalidations (denoted 5x2cv in [5]). So, every number presented below is a mean ±standard deviation over 10 runs. The handwriting databases being large, we did only a single 3fold cross validation. Every class was split once into three equal parts in such a way that the instances of a same writer would all be into the same part (to ensure strict writer independence between training sets and test sets). Thus, all values are averages and standard deviations over 3 runs. The metaparameters such as the number of hidden units, the initial learning rate and the annealing process for the MLP, and the standard deviation of the Gaussian (only RBF kernels where experimented here) and the weight C of the regularizing term for the SVM, have been chosen as follows. In the experiments with OPC they were selected, independently for each dataset, through a simple tryanderror process and ﬁxed to the same value for every one of the K binary classiﬁers. In PWC the characteristics of each of the K(K−1) training sets can 2 vary a lot, therefore we tailored these metaparameters to each binary problem individually. SVMtorch has been used in all SVMs simulations [4]. The experiments with neural networks have been done with our own software implementing a vanilla backpropagation. In this work, the focus is placed on the accuracy, regardless of the complexity of each system. It is clear that the SVMs constructed are always much more complex that the MLPs, both in terms of memory and in terms of evaulation time. For example, for the handwriting recognition tasks, the best single MLPs found had 200 hidden units, while for the SVMs based on OPC, the total number of support vectors (counting only once the support vectors used in diﬀerent machines) was typically of several thousands. 4.3
Results
Table 1 summarize the results obtained. The results of a single MLP are reported in the column ‘single’. Three reconstruction methods discussed in Section 2 are reported here. Standard deviations are indicated in small fonts. For the soft voting rule, a tanh was used with four diﬀerent sloaps, and only the best of the three results is reported here. The sloap has an important inﬂuence of the eﬃciency of the method. If it is too high, the soft voting is identical to the fair voting. If it is too low, than the performance drops and in the case of SVM it can get really poor. This is easily explained by the fact that the outputs of the SVM are in a wide range (this is even worse with polynomial kernels) and mostly only their comparison towards 1, 0 and +1 is meaningful.. Except for segmentation and yeast, SVMs are always outperforming MLPs, and sometimes very signiﬁcantly (vowel, dg, AZ, and az). Replacing a single MLP
Multiclass Classiﬁcation with Pairwise Coupled NN or SVM
319
Table 1. Means and standard deviations of percentage of error on test sets. Db glass segm. ecoli yeast vowel dg AZ az
single
OPC
31.22 4.3 3.23 0.2 14.81 3.1 39.94 1.5 10.42 2.9 1.37 0.0 3.41 0.3 5.21 0.1
33.83 4.0 3.97 0.4 15.94 3.1 47.00 2.0 14.16 2.6 1.32 0.0 3.62 0.3 6.18 0.2
MLP fair 35.89 3.3 5.10 0.9 17.31 3.0 43.77 1.5 23.13 2.8 3.08 0.3 6.22 0.4 8.15 0.1
PWC soft 35.42 3.9 5.09 0.9 17.54 3.0 43.60 1.6 23.19 2.9 3.06 0.4 6.18 0.4 8.12 0.1
prob 35.42 3.6 5.14 1.0 18.33 2.0 44.06 1.4 22.97 2.9 3.10 0.4 6.11 0.4 8.04 0.1
OPC 30.75 3.1 4.03 0.6 12.37 3.1 40.53 1.1 4.42 1.6 0.95 0.1 2.44 0.1 4.23 0.1
SVM PWC fair soft 31.68 2.8 31.87 2.7 3.68 0.3 3.62 0.3 13.62 3.3 13.69 3.3 41.24 1.9 41.22 1.9 4.04 1.9 3.98 1.9 0.98 0.1 0.95 0.1 2.63 0.1 2.50 0.1 4.29 0.1 4.27 0.1
prob 32.06 3.0 3.66 0.2 13.80 3.2 41.52 1.7 4.08 1.9 0.96 0.1 2.89 0.1 4.33 0.1
by many binary ones, either with OPC or with PWC never helps, except for dg, were a slight improvement was obtained with OPC. This is a surprise however, as the parameters of the single MLP (number of hidden units, initial learning rate, annealing scheme) had been tailored with a great care for this problem in particular. For both, MLPs and SVMs, OPC often outperforming PWC. This was a suprise to us, specially given that the metaparameters where adjusted independently for each binary classiﬁer involved in PWC, and that we ﬁne tuned as well the reconstruction method. If it could be justiﬁed by too small datasets in the case of Irvine databases, this argument does not hold in the case of handwriting recognition. For dg there is no signiﬁcant diﬀerence, but in the case of AZ and az, the best results obtained with SVM and PWC is worse that OPC, with conﬁdence 0.95, according to the 5x2cv F test [1]. This is in contradiction with previous works showing that PWC always outperforms OPC [9,17]. After the ﬁrst submission of the present paper, as an attempt to understand this disagreement between our results and the cited work, we carried out new experiments replacing the RBF kernels by polynomial kernels, which were used in [9,17]. As a matter of fact, this change degraded the results with oneperclass far below the results with pairwise coupling, which stayed almost unchanged. There is a fundamental diﬀerence in nature between these two types of kernels, which makes them appropriate or not for oneperclass or pairwise coupling decomposition schemes. This issue requires further investigation, but a simple explanation of the fact that pairwise coupling does not improve on oneperclass when RBF kernels are used could be that for inputs with no similarities with the ones seen during training (often the case with pairwise coupling), the SVM output is close to zero (does not buy us anything) with RBF kernels, but has a large magnitude for polynomial kernels.
5
Conclusions and Further Research
In this paper, we compare two simple ways of expressing a multiclass classiﬁcation problem into several 2class problems. The potentials of such approaches to enhance the classiﬁcation accuracy of multilayered neural networks is evaluated.
320
Eddy Nicolas Mayoraz
This is motivated by the excellent results obtained by SVMs which, in order to handle multiclass problems, are used through the same decomposition principles. Among the advantages of training many classiﬁers on simpler tasks, we can mention the possibility to validate each subsystem individually and ﬁnetune their training parameters. The conclusions are threefold. First, it is not easy to improve the performance of a single neural network by using one binary neural network for each dichotomy of a decomposition scheme. Even though we spent a lot of eﬀort adjusting the training of each individual binary net, retraining the ones whose performances where not completely satisfactory, we succeeded only once out of 8 benchmarks, to achieve signiﬁcantly better results than a single network. Second, the common belief that when using SVMs, pairwise coupling always outperforms oneperclass has been revisited. If this seems to be true when polynomial kernels are used, it does not hold at all for RBF kernels, which give scores slightly in favor of oneperclass over pairwise coupling. Finally, SVMs always outperform neural networks. However, in our experiments the size of the models was deliberately not an issue and each model has been dimensioned to there best recognition performances. The minor improvement on the recognition rates given by SVMs over neural networks costs often one or two order of magnitude in size. It would be interesting to see how the results of SVMs degrade when we progressively suppress the support vectors of lowest coeﬃcients.
Acknowledgements The author is thankful to Vincent Wan for initiating this research during his visit at Motorola. This work has been made possible by the use of SVMTorch, which can handle eﬃciently quite large datasets. The author is also thankful to his colleague Giovanni Seni for providing an eﬃcient code for backpropagation of feedforward neural networks.
References 1. Ethem Alpaydin. Combined 5 × 2 cv f test for comparing supervised classiﬁcation learning algorithms. Neural Computation, 11(8):1885–1892, 1999. 2. Herv´e A. Bourlard and Nelson Morgan. Links between markov models and multilayer perceptrons. In Ed. D. S. Touretzky, editor, Advances in Neural Information Processing Systems 1 (NIPS*88), volume 1, pages 502–510, San Mateo, CA, 1989. Morgan Kaufmann. 3. Pierre J. Castellano, Stefan Slomka, and Sridha Sridharan. Telephone based speaker recognition using multiplt binary classiﬁer and Gaussian mixture models. In ICASSP, volume 2, pages 1075–1078. IEEE Computer Society Press, 1997. 4. Ronan Collobert and Samy Bengio. Support vector machines for largescale regression problems. Journal of Machine Learning Research, 1:143–160, 2001. http://www.idiap.ch/learning/SVMTorch.html.
Multiclass Classiﬁcation with Pairwise Coupled NN or SVM
321
5. T. G. Dietterich. Approximate statistical tests for comparing supervised classiﬁcation learning algorithms. Neural Computation, 7, 1998. 6. T. G. Dietterich and G. Bakiri. Errorcorrecting output codes : A general method for improving multiclass inductive learning programs. In Proceedings of AAAI91, pages 572–577. AAAI Press / MIT Press, 1991. 7. Dominique Genoud, Miguel Moreira, and Eddy Mayoraz. Text dependent speaker veriﬁcation using binary classiﬁers. IDIAPRR 8, IDIAP, 1997. To appear in the Proceedings of the International Conference on Automatic Speech and Signal Processing, ICASSP’98. 8. Stefan Knerr, Leon Personnaz, and Gerard Dreyfus. Handwritten digit recognition by neural networks with singlelayer training. IEEE Trans. on Neural Networks, 3(6):962–968, November 1992. 9. Ulrich Kreßel. Pairwise classiﬁcation and support vector machines. In B. Schlkopf, C. J. C. Burges, and A. J. Smola, editors, Advances in Kernel Methods — Support Vector Learning. MIT Press, 1999. 10. Eddy Mayoraz and Miguel Moreira. On the decomposition of polychotomies into dichotomies. In Douglas H. Fisher, editor, The Fourteenth International Conference on Machine Learning, pages 219–226, 1997. 11. Eddy Mayoraz and Ethem Alpaydin. Support vector machine for multiclass classiﬁcation. In Proceedings of the International Workshop on Artiﬁcial Neural Networks (IWANN’99), 1998. ftp://ftp.idiap.ch/pub/reports/1998/rr9806.ps.gz. 12. C. J. Merz and P. M. Murphy. UCI repository of machine learning databases. Technical report, Irvine, CA: University of California, Department of Information and Computer Science, 1998. Machinereadable data repository http://www.ics.uci.edu/{$\sim$}mlearn/MLRepository.html. 13. J. Platt, N. Cristianini, and J. ShaweTaylor. Large margin dags for multiclass classiﬁcation. In S.A. Solla, T.K. Leen, and K.R. Muller, editors, Advances in Neural Information Processing Systems 12 (NIPS*1999). MIT Press, 2000. 14. David Price, Stefan Knerr, Leon Personnaz, and Gerard Dreyfus. Pairwise neural network classiﬁers with probabilistic outputs. In G. Tesauro, D. Touretzky, and T. Leen, editors, Advances in Neural Information Processing Systems 7 (NIPS*94), volume 7, pages 1109–1116. The MIT Press, 1995. 15. M. Schmidt and H. Gish. Speaker identiﬁcation via support vector classiﬁers. In Proceedings of ICASSP’96, pages 105–108, Atlanta, GA, 1996. 16. B. Sch¨ olkopf, C. Burges, and V. Vapnik. Extracting support data for a given task. In U. M. Fayyad and R. Uthurusamy, editors, Proceedings of the First International Conference on Knowledge Discovery and Data Mining, pages 252–257. AAAI Press, 1995. 17. Fusi Wang, Louis Vuurpijl, and Lambert Schomaker. Support vector machines for the classiﬁcation of western handwritten capitals. In Seventh International Workshop on Frontiers in Handwriting Recognition Proceedings, pages 167–176, 2000. 18. H. White. Learning in artiﬁcial neural networks: A statistical perspective. Neural Computation, 1(4):425–464, 1989. 19. Stephen A. Zahorian, Peter Silsbee, and Xihong Wang. Phone classiﬁcation with segmental features and a binarypair partitioned neural network classiﬁer. In ICASSP, volume 2, pages 1011–1014. IEEE Computer Society Press, 1997.
Incremental Support Vector Machine Learning: A Local Approach Liva Ralaivola and Florence d’Alch´e–Buc Laboratoire d’Informatique de Paris 6, Universit´e Pierre et Marie Curie, 8, rue du Capitaine Scott, F75015 Paris, France {liva.ralaivola,florence.dalche}@lip6.fr
Abstract. In this paper, we propose and study a new online algorithm for learning a SVM based on Radial Basis Function Kernel: Local Incremental Learning of SVM or LISVM. Our method exploits the “locality” of RBF kernels to update current machine by only considering a subset of support candidates in the neighbourhood of the input. The determination of this subset is conditioned by the computation of the variation of the error estimate. Implementation is based on the SMO one, introduced and developed by Platt [13]. We study the behaviour of the algorithm during learning when using diﬀerent generalization error estimates. Experiments on three data sets (batch problems transformed into online ones) have been conducted and analyzed.
1
Introduction
The emergence of smart portable systems and the daily growth of databases on the Web has revived the old problem of incremental and online learning. Meanwhile, advances in statistical learning have placed Support Vector Machines (SVM) as one of the most powerful family of learners (see [6,16]). Their speciﬁcity lies on three characteristics: SVM maximizes a soft margin criterion, the major parameters of SVM (support vectors) are taken from the training sample and non linear SVM are based on the use of kernels to deal with high dimensional feature space without directly working in it. However, few works tackle the issue of incremental learning of SVM. One of the main reasons lies on the nature of the optimization problem posed by SVM learning. Although there exist some very recent works that propose ways to update SVM each time new data are available [4,11,15], they generally imply to relearn the whole machine. The work presented here starts from another motivation: since the principal parameters of SVM are the training points themselves, and as far as a local kernel such as a gaussian kernel is used, it is possible to focus learning only on a neighbourhood of the new data and update the weights of concerned training data. In this paper, we brieﬂy present the key idea of SVM and then introduce incremental learning problem. State of the art is shortly presented and discussed. Then, we present the local incremental algorithm or LISVM and discuss the G. Dorﬀner, H. Bischof, and K. Hornik (Eds.): ICANN 2001, LNCS 2130, pp. 322–330, 2001. c SpringerVerlag Berlin Heidelberg 2001
Incremental Support Vector Machine Learning: A Local Approach
323
model selection method to determine the size of the neighbourhood to be used at each step. Numerical simulations on IDA benchmark datasets [14] are presented and analyzed.
2
Support Vector Machines
Given a training set S = {(xi , yi )}i=1 , support vector learning [6] tries to ﬁnd a hyperplane with minimal norm that separates the data mapped into a feature space Ψ via a nonlinear map Φ : IRn → Ψ , where n denotes the dimension of vectors xi . To construct such a hyperplane, one must solve the following quadratic problem [3]: max α
subject to
i=1
l
αi −
1 αi αj yi yj k(xi , xj ) 2 i,j=1
αi y i = 0
and
0≤α≤C
(1) (2)
i=1
with kernel function k deﬁned as k(x, x ) = Φ(x) · Φ(x ). Solution of this prob lem provides w = i=1 αi yi Φ(xi ), a real value b with the help of optimality conditions and the decision rule: f (x) = sgn(w · Φ(x) + b). The most eﬃcient practical algorithms to achieve the resolution of the latter problem implement a strategy of iterative subsets selection [9,12,13] where only the points in those subsets may see their corresponding lagrange multiplier change. This process of optimizing the global quadratic problem only on a subset of the whole set of variables is a key point of our algorihtm since we use such an optimization scheme on subsets deﬁned as neighbourhoods of new incoming data.
3
Incremental Learning and SVM
In incremental learning, the training dataset is not fully available at the beginning of the learning process as in batch learning. Data can arrive at any time and the hypothesis has to be updated if necessary to capture the class concept. In the following, we suppose that data are drawn from a ﬁxed but unknown distribution. We view this task as a ﬁrst step towards learning drifting concepts, e.g. learning to classify data that are drawn from a distribution that change over time. More formally, the problem considered here may be stated as follows. Let H be a hypothesis family (such as the gaussain kernelbased SVM family). Let F be a ﬁxed but unknown distribution over (x, y) pairs, x ∈ X and y ∈ {−1, 1}. We suppose that at time t, a training pattern is randomly sampled from F . Let us deﬁne St the current observed sample at time t. The goal of incremental online learning is to ﬁnd and update an hypothesis ht ∈ H using available examples in order to minimize generalization error.
324
Liva Ralaivola and Florence d’Alch´e–Buc
The KernelAdatron algorithm [8] is a very fast approach to approximate the solution of the support vector learning and can be seen as a componentwise optimization algorithm. It has been succesfully tested by their authors to dynamically adapt the kernel parameters of the machine, doing model selection in the learning stage. Nevertheless, the only drawback of this work is that it should not be straightforward to extend this work to deal with drifting concepts. Another approach, proposed in [15], consists in learning new data by discarding all past examples except support vectors. The proposed framework thus relies on the property that support vectors summarize well the data and has been tested against some standard learning machine datasets to evaluate some goodness criteria such as stability, improvement and recoverability. Finally, a very recent work [4] proposes a way to incrementally solve the global optimization problem in order to ﬁnd the exact solution. Its reversible aspect allows to do “decremental” unlearning and to eﬃciently compute leaveoneout estimations.
4
Local Incremental Learning of a Support Vector Machine
We ﬁrst consider SVM as a voting machine that combines the outputs of experts, each of which is associated with a support vector in the input space. When using RBF kernel or any kernel that is based upon the notion of neighbourhood, the inﬂuence of a support vector concerns only a limited area with a high degree. Then, in the framework of online learning, when a new example is available it should not be necessary to reconsider all the current experts but only those which are concerned by the localization of the input. In some extent, the proposed algorithm is linked with work of Bottou and Vapnik [1] about local learning algorithm. 4.1
Algorithm
We sketch the updating procedure to build ht from ht−1 when the incoming data (xt , yt ) is to be learned, St−1 being the set of instances learned so far, and St = St−1 ∪ {(xt , yt )}: 1. Initialize lagrangian multiplier αt to zero 2. If yt f t−1 (xt ) > 1 (point is well classiﬁed) then terminate (take ht−1 as new hypothesis ht ) 3. Build a working subset of size 2 with xt and its nearest example in input space 4. Learn a candidate hypothesis g by optimizing the quadratic problem on examples in the working subset 5. If the generalization error estimation of g is above a given threshold δ, increase the working subset by adding the next closest point to xt not yet in the current subset and return to step 4 6. ht is set to g
Incremental Support Vector Machine Learning: A Local Approach neighbourhoods
status changed new point
y f (x) = −1 y f (x) = 1
325
f (x) = 0
new point
new decision surface
f(x) = 1 + ε
Fig. 1. Left: three neighbourhoods around the new data and the interesting small band of points around the decision surface parameterized by ε. Right: new decision surface and example of point whose status changed
We stop constructing growing neighbourhoods (see Fig. 1) around the new data as soon as the generalization error estimation falls under a given threshold δ. A compromise is thus performed between complexity in time and the value of the generalization error estimate, as we consider that this estimate is minimal when all data are relearned. The key point of our algorithm thus lies on ﬁnding the size of the neighbourhood (e.g. the number K of neighbours to be considered) and thus on ﬁnding a well suited generalization error estimate, what we will focus on in the next section. To increase computational speed, and implement our idea of locality, we only consider, as shown in Fig. 1, a small band around the decision surface in which points may be insteresting to reconsider. This band is deﬁned upon a real ε: points of class yt for which yt f (xt ) ≤ 1 and points of class −yt for which yt f (xt ) ≤ 1 + ε are in the band. 4.2
Model Selection and the Neighbourhood Determinaton
The local algorithm requires to choose the value of K or the neighbourhood size. We have several choices to do that. A ﬁrst simple solution is to ﬁx K a priori before the beginning of the learning process. However the best K at time t is obviously not the best one at time t + t . So it may be diﬃcult to choose a single K suitable for all points. A second more interesting solution is therefore to determine it automatically through the minimization of a cost criterion. The idea is to apply some process of model selection upon the diﬀerent hypothesis htk that can be built. The way we choose to select models consists in comparing them according to an estimate of their generalization error. One way to do that is to evaluate the estimation error on a test set and thus keeping K as the one for which ET est is the least. In real problems, it is however not realistic to be provided with a test set during the incremental learning process. So this solution cannot be considered as a good answer to our problem. Elsewhere, there exist some analytical expression of LeaveOneOut estimates of SVMs generalization error such as those recalled in [5]. However, in order to use these estimates, one has to ensure that the margin optimization problem has
326
Liva Ralaivola and Florence d’Alch´e–Buc
been solved exactly. The same holds for Joachims’ξα−estimators [10,11]. This restriction prevents us from using these estimates as we only do a partial local optimization. To circumvent the problem, we propose to use the bound on generalization provided by a result of Cristianini and ShaweTaylor [7] for thresholded linear realvalued functions. While the bound it gives is large, it allows to “qualitatively” compare the behaviours of functions of the same family. The theorem states as follows: Theorem 1. Consider thresholding realvalued linear functions L with unit weight vectors on an inner product space X and ﬁx γ ∈ IR+ . There is a constant c, such that for any probability distribution D on X × {−1, 1} with support in a ball of radius R around the origin, with probability 1 − η over random examples S, any hypothesis f ∈ L has error no more than c R2 + ξ 2 1 2 = errD (f ) ≤ B = (3) log + log γ2 η where ξ = ξ(γ, h, S) = (ξ1 , ξ2 , . . . , ξ ) is the margin slack vector with respect to f and γ deﬁned as ξi = max(0, γ − yi f (xi )). We notice that once the kernel parameter σ is ﬁxed, this theorem, directly applied in the feature space Ψσ deﬁned by the kernel kσ , provides an estimate of generalization error for the machines we work on. This estimate is expressed in terms of a margin value γ, the norm of the slack margin vector ξ and the radius of the ball containing the data. In order to use this theorem, we consider the feature space of dimension d(K) deﬁned by the Gaussian kernel with a ﬁxed value for σ. In this space, we consider L with unit weight vectors. At step t, diﬀerent functions htk can be learnt with k = 1, ..., Kmax . For each k, we get a function fkt by normalizing the weight vector of htk . fkt belongs to L and when thresholded provides the same outputs than htk does. The theorem can then be applied to f = fkt and data of St . It ensures that:
≤ B(c(γ, L), Rf , ξf , γ, t).
(4)
Hence, for each k = 1...Kmax , we can use this bound as a test error estimate. However, as Rf is the radius of the ball containing the examples in the feature space [17], it only depends on the chosen kernel and not on k. On the contrary, t = max(0, γ − yi fkt (xi )) is the unique quantity which diﬀers ξf , deﬁned as: ξi,k among functions fk . Slack vector ξkt are thus suﬃcient to compare fk functions, justifying our choice to use it as a model selection criterion. Looking at the bound, we can see that a value of γ must be chosen: in order to do that, we take a timevarying value deﬁned as γt = 1/ wt−1 .
5
Experimental Results
Experiments were conducted on three diﬀerent binary classiﬁcation problems: Banana [14], Ringnorm [2] and Diabetes. Datasets are available at www.ﬁrst.gmd.
Incremental Support Vector Machine Learning: A Local Approach (a) Banana
(b) Banana
0.6
100
Rebatch LISVM 0.1 LISVM 0.01 LISVM 0.001 Best Test
Rebatch LISVM 0.1 LISVM 0.01 LISVM 0.001 Best Test
90 80 # of support vectors
0.5
validation error
327
0.4 0.3 0.2
70 60 50 40 30 20
0.1
10 0
0
50
100
150
200
250
300
350
0
400
0
50
# of training instances observed
100
150
(c) Banana 1400
0.8
300
350
400
300
350
400
Rebatch LISVM 0.1 LISVM 0.01 LISVM 0.001 Best Test
1200
0.7 1000
0.6
800
0.5
W2
ξ2 / training instances
250
(d) Banana
0.9
0.4
600
0.3
400
Rebatch LISVM 0.1 LISVM 0.01 LISVM 0.001 Best Test
0.2 0.1 0
200
# of training instances observed
0
50
100
150
200
250
# of training instances observed
300
350
200
400
0
0
50
100
150
200
250
# of training instances observed
Fig. 2. Evolution of the machine parameters for the Banana problem during online learning with a potential band (see Fig. 1) deﬁned by ε = 0.5
de/˜raetsch/. For each problem, we tested LISVM for diﬀerent values of the threshold δ. The main points we want to assess are the classiﬁcation accuracy our algorithm is able to achieve, the appropriateness of the proposed criterion to select the “best” neighbourhood and the relevance of the local approach. We simulated online incremental learning by providing the classiﬁer with one example at a time, taken from a given training set. After each presentation, the current hypothesis is updated and evaluated on a validation set of size two thirds the size of the corresponding testing set. Only the “best test” algorithm uses the remaining third of this latter set to perform neighbourhood selection. An other incremental learning process called “rebatch” is also evaluated: it consists in realizing a classical SVM learning procedure over the whole dataset when a new instance is available. Experiments are run on 10 samples in order to compute means and standard deviations. Quantities of interest are plotted on Fig. 2 and Fig. 3. Table 1 reports the results at the end of the training process. We led experiments for the same range of δ values in order to show that it is not diﬃcult to ﬁx a threshold that implies correct generalization performance. However, we must be aware that δ reﬂects our expectation of the error performed by the current hypothesis. Hence, smaller threshold δ should be preferred when the data are assumed to be easily separable (e.g. Ringnorm) while bigger values should be ﬁxed when data are supposed to be harder to discriminate. This remark can be conﬁrmed by the observation of Banana and Diabetes results. While
328
Liva Ralaivola and Florence d’Alch´e–Buc
Table 1. Results on the three datasets. The potential band used (see Fig. 1) is deﬁned by ε = 0.5. Classical SVM parameters are in the topleft cell of each table Banana C = 100, σ = 1 Batch/Rebatch Best test LISVM δ = 0.001 LISVM δ = 0.01 LISVM δ = 0.1
validation error 0.107 ± 0.004 0.112 ± 0.013 0.113 ± 0.008 0.113 ± 0.005 0.111 ± 0.006
Ringnorm C = 1e9, σ = 10 Batch/Rebatch LISVM δ = 0.001 LISVM δ = 0.01 LISVM δ = 0.1 Diabetes C = 10, σ = 20 Batch/Rebatch LISVM δ = 0.001 LISVM δ = 0.01 LISVM δ = 0.1
Nb of Svs w2 91.2 ± 9.8 877 ± 166 52.7 ± 21.5 732 ± 232 84.1 ± 18.8 1220 ± 291 89.9 ± 9.4 1270 ± 244 92.1 ± 7.9 1110 ± 220
validation error 0.0263 ± 0.004 0.0272 ± 0.006 0.0324 ± 0.008 0.0628 ± 0.013
validation error 0.228 ± 0.022 0.226 ± 0.023 0.229 ± 0.021 0.223 ± 0.023
w2 ξ(1, h, S)2 / 3290 ± 701 4.5e6 ± 4.5e6 3560 ± 809 0.007 ± 0.002 2180 ± 559 0.046 ± 0.012 1140 ± 500 0.155 ± 0.035
Nb of Svs 80.4 ± 8.5 80.4 ± 8.5 65.9 ± 4 49.7 ± 5.8
Nb of Svs 275 ± 7.1 197 ± 27.9 227 ± 36 226 ± 39.2
w2 ξ(1, h, S)2 / 382 ± 30.2 0.694 ± 0.017 362 ± 70.2 0.646 ± 0.025 382 ± 71.8 0.652 ± 0.031 386 ± 66.4 0.652 ± 0.028
Ringnorm 0.6
Diabetes 0.6
Rebatch LISVM 0.1 LISVM 0.01 LISVM 0.001
0.5
ξ(1, h, S)2 / 0.281 ± 0.036 0.801 ± 0.809 0.328 ± 0.073 0.309 ± 0.049 0.305 ± 0.04
Rebatch LISVM 0.1 LISVM 0.01 LISVM 0.001
0.55
validation error
validation error
0.5 0.4 0.3 0.2
0.45 0.4 0.35 0.3
0.1 0
0.25 0
50
100
150 200 250 300 # of training instances observed
350
400
0.2
0
50
100
150 200 250 300 350 # of training instances observed
400
450
500
Fig. 3. Evolution of validation error for Ringnorm (left) and Diabetes (right)
the chosen values of δ lead to equivalent (or bigger) complexity than the batch algorithm for equivalent validation error, other experiments with δ = 0.5 show that LISVM obtains a lower complexity (55 SVs and w 2 = 677 for Banana, 179 SVs and w 2 = 927 for Diabetes) but with a degraded performance on the validation set (rates of 0.16 and 0.24 respectively). For Ringnorm, this observation can also be made in the chosen range of values [0.001; 0.01; 0.1]. Relaxing the value of δ leads to a lower complexity at the cost of a higher validation error. These experiments conﬁrm that the relevant range of δ correponds to a balance between a low validation error and a small number of neighbours needed to reach the threshold at each step. CPU time measures provide means to directly evaluate δ for which the local approach sounds attractive. In the Ringnorm task for instance, CPU time is of 12.3 s for a large neighbourhood (δ = 0.001) while it
Incremental Support Vector Machine Learning: A Local Approach
329
is reduced to 4.0 s and 1.9 s for respective smaller δ values of 0.01 and 0.1 and lower complexity. Several curves reﬂecting the behaviour of the algorithm during time were drawn for the Banana problem. Same curves were measured for the other problems but are omitted for sake of space. The validation errors curves show the convergence of the algorithm. This behaviour is conﬁrmed on the Fig. 2(c) where all the incremental algorithms exhibit a stabilizing value for ξ 2 /. For LISVM and “rebatch” algorithm, the number of support vectors linearly increase with the number of new observed instances. This is not an issue, considering that the squared norm of the weight vector increases very much slower, suggesting that if the number of training instances had been bigger, a stabilization should have been observed. At last, let us consider the behaviour of the “best test” algorithm to LISVM on the Banana problem. This algorithm performs the SVM selection by choosing the size of the neighbourhood that minimizes the test error and thus is very demanding in terms of CPU time. Nevertheless, it is remarkable to notice that it reaches the same validation performance with twice less support vectors and a restricted norm of the weight vector, illustrating the relevance of the local approach.
6
Conclusion
In this paper, we propose a new incremental learning algorithm designed for RBF kernelbased SVM. It exploits the locality of RBF by relearning only weights of training data that lie in the neighbourhood of the new data. Our scheme of model selection is based on a criterion derived from a bound an error generalization from [7] and allows to determine a relevant neighbourhood size at each learning step. Experimental results on three data sets show very promising results and open the door to real applications. The reduction in terms of CPU time provided by the local approach should be especially important in case of availability of numerous training instances. Next further works concern tests on large scale incremental learning tasks like text categorization. The possibility of the δ parameter to be adaptive will also be studied. Moreover, LISVM will be extended to the context of drifting concepts by the use of a temporal window.
References 1. L. Bottou and V. Vapnik. Local learning algorithms. Neural Computation, 4(6):888–900, 1992. 2. L. Breiman. Bias, variance and arcing classiﬁers. Technical Report 460, University of California, Berkeley, CA, USA, 1996. 3. C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2):955–974, 1998. 4. G. Cauwenberghs and T. Poggio. Incremental and decremental support vector machine learning. In Adv. Neural Information Processing, volume 13. MIT Press, 2001.
330
Liva Ralaivola and Florence d’Alch´e–Buc
5. O. Chapelle, V. Vapnik, O. Bousquet, and S. Mukherjee. Choosing kernel parameters for support vector machines. Technical report, AT&T Labs, March 2000. 6. C. Cortes and V. Vapnik. Support vector networks. Machine Learning, 20:1–25, 1995. 7. Nello Cristianini and John ShaweTaylor. An Introduction to Support Vector Machines and other kernelbased learning methods, chapter 4 Generalisation Theory, page 68. Cambridge University Press, 2000. 8. T. Friess, F. Cristianini, and N. Campbell. The kerneladatron algorithm: a fast and simple learning procedure for support vector machines. In J. Shavlik, editor, Machine Learning: Proc. of the 15th Int. Conf. Morgan Kaufmann Publishers, 1998. 9. T. Joachims. Making largescale support vector machine learning practical. In B. Sch¨ olkopf, C. Burges, and A. Smola, editors, Advances in Kernel Methods – Support Vector Learning, pages 169–184. MIT Press, Cambridge, MA, 1998. 10. T. Joachims. Estimating the generalization performance of a svm eﬃciently. In Proc. of the 17th Int. Conf. on Machine Learning. Morgan Kaufmann, 2000. 11. R. Klinkenberg and J. Thorsten. Detecting concept drift with support vector machines. In Proc. of the 17th Int. Conf. on Machine Learning. Morgan Kaufmann, 2000. 12. E. Osuna, R. Freund, and F. Girosi. Improved training algorithm for support vector machines. In Proc. IEEE Workshop on Neural Networks for Signal Processing, pages 276–285, 1997. 13. J.C. Platt. Sequential minimal optimization: A fast algorithm for training support vector machines. Technical Report 9814, Microsof Research, April 1998. 14. G. R¨ atsch, T. Onoda, and K.R. M¨ uller. Soft margins for AdaBoost. Technical Report NCTR1998021, Department of Computer Science, Royal Holloway, University of London, Egham, UK, 1998. 15. N. Syed, H. Liu, and K. Sung. Incremental learning with support vector machines. In Proc. of the Int. Joint Conf. on Artiﬁcial Intelligence (IJCAI), 1999. 16. V. Vapnik. The nature of statistical learning theory. Springer, New York, 1995. 17. V. Vapnik. Statistical learning theory. John Wiley and Sons, inc., 1998.
Learning to Predict the LeaveOneOut Error of Kernel Based Classiﬁers Koji Tsuda1,3 , Gunnar R¨ atsch1,2 , Sebastian Mika1 , and KlausRobert M¨ uller1,2 1
GMD FIRST, Kekul´estr. 7, 12489 Berlin, Germany University of Potsdam, Am Neuen Palais 10, 14469 Potsdam AIST Computational Biology Research Center, 2416, Aomi, Koutouku, Tokyo, 1350064 , Japan {tsuda,raetsch,mika,klaus}@first.gmd.de 2
3
Abstract. We propose an algorithm to predict the leaveoneout (LOO) error for kernel based classiﬁers. To achieve this goal with computational eﬃciency, we cast the LOO error approximation task into a classiﬁcation problem. This means that we need to learn a classiﬁcation of whether or not a given training sample – if left out of the data set – would be misclassiﬁed. For this learning task, simple data dependent features are proposed, inspired by geometrical intuition. Our approach allows to reliably select a good model as demonstrated in simulations on Support Vector and Linear Programming Machines. Comparisons to existing learning theoretical bounds, e.g. the span bound, are given for various model selection scenarios.
1
Introduction
Numerous methods have been proposed [7,8,3,5,9] for model selection of kernelbased classiﬁers such as Support Vector Machines (SVMs) [7] and Linear Programming Machines (LPMs) [1]. They all try to ﬁnd a reasonably good estimate of the generalization error to select the proper hyperparameters. The data dependent LOO error would in principle be ideal for selecting hyperparameters of learning machines, as it is an (almost) unbiased estimator of the true generalization error [6]. Its computation is, unfortunately, for most practical cases prohibitively slow. There have been several attempts to approximate the leaveoneout error in closedform for SVM classiﬁers [8,3,5]. For example, a new type of bound was proposed that relies on the span of the SVs and was empirically found to perform best among the learning theoretical bounds [8]. However, such approximations are limited to a special learning machine, i.e. SVM, and it seems diﬃcult to provide a useful approximation that is valid for a more general class of classiﬁers. In this work, we introduce a learning approach for approximating the LOO error of general kernel classifers such as SVMs or Linear Programming Machines (LPM). We propose to use geometrical features to cast the leaveoneout error approximation problem for kernel classiﬁers into a – fast solvable – classiﬁcation problem. Thus, our LOO error approximation problem reduces to learn a G. Dorﬀner, H. Bischof, and K. Hornik (Eds.): ICANN 2001, LNCS 2130, pp. 331–338, 2001. c SpringerVerlag Berlin Heidelberg 2001
332
Koji Tsuda et al.
classiﬁcation of whether or not a training sample, if left out the data set, will be misclassiﬁed. This task is referred to as a meta learning problem because we try to learn about a learning machine. The meta classiﬁcation task is data rich as a large number of training patterns can be generated from all sorts of diﬀerent classiﬁcation problems. Note that we are using features that are meant to reﬂect the local diﬃculty or complexity of the meta learning problem across a large set of possible data. In experiments, we show that our approach works well both for SVM and LPM. After reviewing some popular learning theoretical LOO error bounds we describe the features used as inputs for the meta learning problem and solve it by using a classiﬁcation approach. Subsequently we perform simulations showing the usefulness of our LOO error approximation scheme in comparison to other bounds and ﬁnally conclude with some remarks.
2
Reviewing SVMs, LPMs and Selected LOO Bounds
When learning with SVMs [7] and LPMs [1], one is seeking for the coeﬃcients of a linear combination of kernel functions K(xi , ·), i.e. fα,b (x) = b + i=1 αi K(xi , x). This is done by solving the following type of optimization problem [4]: min αP + C
α,b,ξ≥0
i=1 ξi
with yi fα,b (xi ) ≥ 1 − ξi ,
i = 1, . . . , ,
(1)
where · P is the 2norm of α in feature space for SVMs and the 1norm of α (in the coeﬃcient space) for LPMs, respectively. The data and labels are denoted by xi ∈ Rn , y ∈ {1, −1} respectively. C is the regularization parameter. When solving (1), one usually introduces Lagrange multipliers λi , i = 1, . . . , for the constraints in (1), which are zero if the constraint is not active. For SVMs they turn out to be equal to αi . For LPM there is no such correspondence. We will now review some bounds on the LOO error of SVMs that have been proposed. A more complete presentation can be found in e.g. [2,8]. Let Z be a sample of size , where each pattern is deﬁned as zi = (xi , yi ). Furthermore, deﬁne Z p = {zi ∈ Z, i =p} and f p = L(Z p ), i.e. f p is the decision obtained when learning with the pth sample left out of the training set. The LOO error is deﬁned as loo (Z) =
1 Ψ (−yp f p (xp )), p=1
(2)
where Ψ (z) = 1 for z > 0 and Ψ (z) = 0 otherwise. As −yp f p (xp ) is positive only if f p commits an error on xp , loo (Z) is the average number of patterns which are misclassiﬁed when they are left out. Support Vector Count: SVMs have several useful properties that can be exploited for a LOO prediction: The ﬁrst is that patterns which are not Support
Learning to Predict the LeaveOneOut Error of Kernel Based Classiﬁers
333
Vectors (SVs) do not change the decision surface and are always correctly classiﬁed. Therefore, one has to consider only the SVs in the LOO procedure and the LOO error can be easily bounded by #SV , (3) where #SV is the number of SVs. However, this is a very rough estimate, because not all SVs will be misclassiﬁed when they are removed from the training set. For LPMs, (3) also holds, if one deﬁnes the SVs to be the patterns xi whose expansion coeﬃcients αi or corresponding Lagrange multipliers λi are nonzero. loo (Z) ≤
JaakkolaHaussler Bound: bound than Eq.(3)
In [3], Jaakkola and Haussler proposed a tighter
loo (Z) ≤
1 Ψ (αp K(xp , xp ) − yp f (xp )), i=1
(4)
where αi is the weight coeﬃcient of SVM and K is the kernel function. Span Predictions: Recently, a sophisticated way of predicting the LOO error using the “span” of support vectors has been proposed [8]. Under the assumption that the set of SVs does not change during the LOO procedure, the LOO error can be exactly rewritten as 1 Ψ (−yp f (xp ) + αp Sp2 ), (5) loo (Z) = p∈SV
where SV denotes the set of support vectors and Sp is a geometric value called “span” of a support vector [8]. Unfortunately, in practice the above assumption made is not always satisﬁed, however, experimentally it is shown that this approximation works very well [8]. For computing the span one needs to solve an optimization problem which can be very expensive, if the number of SVs is high.
3
Meta Learning: Predicting the Generalization Error from Empirical Data
We will now outline a learning framework to predict the LOO error. The LOO error LOO depends on the data set Z and the parameters of the learning machine θ. We assume that the LOO errors could be measured for several data sets and various combinations (Z1 , θ 1 ), · · · , (Zm , θ m ). Then, a metalearning machine is trained to predict LOO on unseen data based on appropriate features extracted from (Z, θ). Predicting the LOOError as a classiﬁcation problem Recall that the LOO error can be represented by each LOO result r(xp ): LOO =
1 1 Ψ (−yp f p (xp )) := Ψ (−r(z p )) p=1 p=1
(6)
334
Koji Tsuda et al.
Left out sample
(z p , αp )
Meta Learning Machine
Feature Extractor
Neighbors
(z 1 , α1 ) (z 2 , α2 ) (z 3 , α3 )
Estimate of LOO result
r(z p )
Fig. 1. Learning scheme for predicting the LOO result.
With Neighbor
LOO
Without Neighbor
LOO
Fig. 2. Consider the case with a nonSV near the SV of interest as in the left panel. In the leaveoneout procedure, the boundary is reestimated without this SV, but the non SV takes its part and the boundary does not change signiﬁcantly. The LOO boundary could show a large difference, if there was no close nonSV neighbor as in the right panels.
where r(z p ) = sgn(yp f p (xp )). So, to predict the LOO error, it is suﬃcient to predict the result of each LOO procedure, i.e. whether or not an error will be made, if a certain pattern is left out. This meta learning problem is a binary classiﬁcation problem. For kernel classiﬁers, the learning scheme can be designed as Fig.1: Here, a coeﬃcient αi is attached to every training sample xi . Features are extracted for the leftout sample (xp , αp ) as well as for the neighboring samples. The neighbors are included since they are likely to aﬀect the LOO result as shown in Fig. 2. In taking the neighbors, we do not care whether they are support vectors or not. So, the features also include information from nonsupport vectors whereas the span bound (5) is derived from the support vectors only. Features for meta classiﬁcation To include the local geometry around a leftout sample, the features for LOO are extracted as follows: for the leftout sample xp , we use – Weight αp , Dual weight λp (LPM only), Margin f (xp )/w2 . Additionally, we calculate the following quantities from the 3 nearest neighbors xk of xp : – Distance in input space xk − xp /Di , – Distance in feature space Φ(xk ) − Φ(xp )/Df ,
Learning to Predict the LeaveOneOut Error of Kernel Based Classiﬁers
335
– Weight αk , Dual weight λk (LPM only), Margin f (xk )/w2 , where Di , Df are the maximum distances between training samples in the input space and the feature space, respectively. All these features of xp form a vector v p which is used together with the label r(z p ) to learn the LOO error prediction {v p } → {r(z p )} for all p patterns. In this work we built the metaclassiﬁer as a linear programming machine with polynomial kernel of degree 2, which employs the 1norm regularization (in feature space) leading to sparse weight vectors. This turns out to be beneﬁcial as the (meta) LPM automatically selects the relevant features.
4
Experiments
The motivation of the following experiment is to answer the following questions: (1) Does the LOO error predictor learn from a given data set to generalize well on unseen data sets? (2) Is the prediction good enough for reliable model selection? 4.1
Two Class Benchmarks
In our study, we considered three data sets1 : twonorm, ringnorm and heart. Here, twonorm and ringnorm are quite similar because the input dimensionality is 20 and the number of training samples is 400 for both data sets. But, heart is a quite diﬀerent data set, where the input dimensionality is 13 and the number of training samples is only 170. For the evaluation of our meta classiﬁer we use the following experimental setup: For each data set we considered ten realizations (train/testsplits, used for averaging and obtaining error bars). On each realization, we trained SVMs and LPMs for wide ranges of the regularization constant C and the (RBF) kernel parameter σ, i.e. K(x, y) = exp(−x − y/σ 2 ). For each training sample we extracted the features described in Section 3. These features and the corresponding labels (which have been computed by the actual LOO procedure) are used for training and testing our classiﬁer. We learned from two data sets and tested on the third. To evaluate the performance, the model selection error is used: First, the kernel parameter σ and regularization constant C are selected where the predicted LOO error is minimized. The model selection error is then deﬁned as the classiﬁcation error on the test set at the chosen parameters. The results for SVM and LPM are shown in Fig. 3 and 4, respectively. Results on the span bound and JH bound are not available for LPMs, as the respective bounds simply do not exist. The experiments show that in most cases our LOO predictor performs almost as well as the actual (highly expensive) LOO calculation. Compared to the bounds we observe that both methods achieve similar performance, note however that our method can also be applied to LPMs. Comparing the three cases in Fig. 4, our method performs slightly worse when heart 1
The data sets incl. training/test splits and the LOO results can be obtained at http://ida.first.gmd.de/˜raetsch.
336
Koji Tsuda et al. Ringnorm, Heart ↓ Twonorm 0.04
Twonorm, Heart ↓ Ringnorm
Twonorm,Ringnorm ↓ Heart 0.3
0.024 0.022
0.25
0.035 0.02
0.2
0.018
0.03
0.016
0.15
0.025 0.014 0.02
OPT LOOSpan JH SVC Pred
0.012
OPT LOOSpan JH SVC Pred
0.1
OPT LOOSpan JH SVC Pred
Fig. 3. Model selection errors in SVM. The labels denote: OPT: optimal choice based on the test error, LOO: actual leaveoneout calculation, Span: span bound, JH: JaakkolaHaussler Bound, SVC: Support Vector Count, Pred: our method. Two data sets are used for training and the test error is assessed on the third one.
is used for testing in comparison with the other two cases. This shows the tendency that LOO error prediction works well if data with a similar characteristics is contained in training set. This indicates also that the statistics of the feature set still varies considerably for diﬀerent data sets. However, it is still surprising that our simple features can show such a good generalization performance in benchmark problems. 4.2
A Multiclass Example
As one application of our approach, we consider multiclass problems. In solving cclass problems, it is common, e.g. for SVMs, to use c single (twoclass) classiﬁers, each of which is trained to discriminate one class from all the other classes. Here, the hyperparameters of each SVM have to be properly set by some model selection process. Clearly, it takes prohibitively long to perform leaveoneout procedures in all SVMs for all possible hyperparameter settings. To cope with this problem, the leaveoneout procedure is performed with respect to only one of the c classiﬁcation problems and our meta classiﬁer is then trained based on this result. Then, the hyperparameters of the other SVMs can be eﬃciently selected according to the LOO error predictions from the meta classiﬁer. We performed an experiment with the 3dobj data set2 with 8 classes and 640 samples (i.e. 80 samples per class). The task is here to choose a proper value of kernel width σ from a prespeciﬁed set of 11 values. The sample is randomly divided into 320 training and 320 test samples for obtaining error bars on our results. One class is chosen for the training and the remaining classes are used for testing. For each σ in Σ = {0.4, 0.6, · · · , 2.4}, we computed the LOO error predictions using the meta classiﬁer. 2
This dataset will be added to the IDA website.
Learning to Predict the LeaveOneOut Error of Kernel Based Classiﬁers Ringnorm, Heart ↓ Twonorm
Twonorm, Heart ↓ Ringnorm
337
Twonorm, Ringnorm ↓ Heart
0.05
0.024
0.35
0.045
0.022
0.3
0.04
0.02
0.035
0.018
0.03
0.016
0.025
0.014
0.25
0.2
0.02
OPT
LOO
SVC
0.012
Pred
0.15
OPT
LOO
SVC
Pred
0.1
OPT
LOO
SVC
Pred
Fig. 4. Model selection errors in LPM. SVM
LPM
0.08
0.08 0.06 0.06 0.04
0.04
0.02
0.02 0
OPT
LOO
Span
JH
Pred
0
OPT
LOO
Pred
Fig. 5. Model selection errors in the multiclass problem.
Figure 5 shows the model selection errors of SVMs and LPMs. Here, our method performed as well as the actual LOO calculation for both: SVMs and LPMs. The task of model selection in multiclass problems appears very suitable for our method, because the data sets for training and testing are considered to have similar statistical properties.
5
Conclusion
To train a learning machine that learns about the generalization behavior of a set of learning machines seems like an appealing idea and introduces a meta level of reasoning. Our goal in this work was to obtain an eﬀective meta algorithm for predicting the LOO error from past experience, i.e. from a variety of data sets. By adding the twist of casting the meta learning problem into a speciﬁc classiﬁcation task allowed to achieve an accurate and fast empirical estimate of the LOO error of SVM and LPM. The crucial point was to use simple geometrical features for classifying whether a given training sample would be misclassiﬁed, if left out. Once given a reliable LOO error estimate it can easily be used for model selection.
338
Koji Tsuda et al.
Careful simulations (using approximately 1.5 CPU years of 800Mhz Pentium III mostly for obtaining the actual LOO error) show that our meta learning framework compares favorably over the conventional bounds. Apparently, our heuristic geometrically motivated features have a good generalization ability for diﬀerent data sets. We speculate that using these features could provide new ways to improve bounds, in a way, by integrating particularly meaningful features from our meta learning problem into new learning theoretical bounds. Future research will be dedicated to a further exploration of the meta learning idea also for other learning machines and to gaining a better learning theoretical understanding of our ﬁndings.
Acknowledgments We thank for valuable discussions with Jason Weston, Olivier Chapelle and Bernhard Sch¨ olkopf. This work was partially funded by DFG under contract JA 379/71, 91, MU 987/11.
References 1. K.P. Bennett and O.L. Mangasarian. Robust linear programming discrimination of two linearly inseparable sets. Optimization Methods and Software, 1:23–34, 1992. 2. O. Chapelle and V.N. Vapnik. Choosing kernel parameters for support vector machines. Personal communication, mar 2000. 3. T.S. Jaakkola and D. Haussler. Probabilistic kernel regression models. In Proceedings of the 1999 Conference on AI and Statistics, 1999. 4. K.R. M¨ uller, S. Mika, G. R¨ atsch, K. Tsuda, and B. Sch¨ olkopf. An introduction to kernelbased learning algorithms. IEEE Transactions on Neural Networks, 2001. in Press. 5. M. Opper and O. Winther. Gaussian processes and SVM: Mean ﬁeld and leaveoneout. In A.J. Smola, P.L. Bartlett, B. Sch¨ olkopf, and D. Schuurmans, editors, Advances in Large Margin Classiﬁers, pages 311–326, Cambridge, MA, 2000. MIT Press. 6. V.N. Vapnik. Estimation of Dependences Based on Empirical Data. SpringerVerlag, Berlin, 1982. 7. V.N. Vapnik. The nature of statistical learning theory. Springer Verlag, New York, 1995. 8. V.N. Vapnik and O. Chapelle. Bounds on error expectation for support vector machines. Neural Computation, 12(9):2013–2036, September 2000. 9. G. Whaba, Y. Lin, and H. Zhang. Generalized approximate crossvalidation for supportvectormachines: another way to look at marginlike quantities. Technical Report TR1006, Dept. of Statistics, University of Wisconsin, April 1999.
Sparse Kernel Regressors Volker Roth University of Bonn, Department of Computer Science III, Roemerstr. 164, D53117 Bonn, Germany
[email protected] Abstract. Sparse kernel regressors have become popular by applying the support vector method to regression problems. Although this approach has been shown to exhibit excellent generalization properties in many experiments, it suffers from several drawbacks: the absence of probabilistic outputs, the restriction to Mercer kernels, and the steep growth of the number of support vectors with increasing size of the training set. In this paper we present a new class of kernel regressors that effectively overcome the above problems. We call this new approach generalized LASSO regression. It has a clear probabilistic interpretation, produces extremely sparse solutions, can handle learning sets that are corrupted by outliers, and is capable of dealing with largescale problems.
1
Introduction
The problem of regression analysis is one of the fundamental problems within the field of supervised machine learning. It can be stated as estimating a real valued function, given a sample of noisy observations. The data is obtained as i.i.d. pairs of feature vectors N {xi }N i=1 and corresponding targets {yi }i=1 , drawn from an unknown joint distribution p(x, y). Viewed as a function in x, the conditional expectation of y given x is called a +∞ regression function fr (x) = E[yx] = −∞ y p(yx) dy. A very successful approach to this problem is the support vector machine (SVM). It models the regression function by way of kernel functions k(x, xi ):1 fr (x) =
N
i=1
k(x, xi )αi =: Kα.
However, SV regression bears some disadvantages: (i) The predictions cannot be interpreted in a probabilistic way. (ii) The solutions are usually not very sparse, and the number of support vectors is strongly correlated with the sample size. (iii) The kernel function must satisfy Mercer’s condition. A Bayesian approach to kernel regression that overcomes these drawbacks was presented in [8]. This model is referred to as the relevance vector machine (RVM). One of its most outstanding features is the extreme sparsity of the solutions. Once we have 1
For the sake of simplicity we have dropped the constant term α0 throughout this paper. If the kernel satisfies Mercer’s condition, this can be justified either if the kernel has an implicit intercept, i.e. if k(0, 0) > 0, or if the input vectors are augmented by an additional entry 1 (e.g. for polynomial kernels). If Mercer’s condition is violated, which may be possible in the RVM approach, the kernel matrix K itself can be augmented by an additional column of ones.
G. Dorffner, H. Bischof, and K. Hornik (Eds.): ICANN 2001, LNCS 2130, pp. 339–346, 2001. c SpringerVerlag Berlin Heidelberg 2001
340
Volker Roth
successfully trained a regression function, this sparsity allows us to make predictions for new observations in a highly efficient way. Concerning the training phase, however, the original RVM algorithm suffers from severe computational problems. In this paper we present a class of kernel regressors that adapt the conceptual ideas of the RVM and additionally overcome its computational problems. Moreover, our model can easily be extended to robust loss functions. This in turn overcomes the sensitivity of the RVM to outliers in the data. We propose a highly efficient training algorithm that directly exploits the sparsity of the solutions. Performance studies for both synthetic and realword benchmark datasets are presented, which effectively demonstrate the advantages of our model.
2
Sparse Bayesian Kernel Regression
Applying a Bayesian method requires us to specify a set of probabilistic models. A member of this set is called a hypothesis Hα which will have a prior probability P (Hα ). The likelihood of Hα is P (D Hα ), where D represents the data. For regression problems, each Hα corresponds to a regression function fα . Under the assumption that the targets y are generated by corrupting the values of fα by additive Gaussian noise of variance σ, the likelihood of Hα is 2 2 √ 1 (1) i P (yi  xi , Hα ) = i 2πσ 2 exp −(yi − fα (xi )) /(2σ ) . The key concept of the RVM is the use of automated relevance determination (ARD) priors over the expansion coefficients of the following form: N −1 P (α ϑ ) = N (0, Σα ) = i=1 N (0, ϑ i ). (2) This prior model leads us to a posterior of the form 1 N ¯ T A(α − α) ¯ , P (αy, ϑ , σ 2 ) = (2π)− 2 A 2 exp − 12 (α − α)
(3)
¯ = σ12 A−1 K T y. with (inverse) covariance matrix A = Σα−1 + σ12 K T K and mean α From the form of the (1) and (2) it is clear that the posterior mean vector minimizes the quadratic form N 2 2 M (α) = y − Kα + σ 2 αT Σα−1 α = y − Kα + i=1 ϑi αi2 , (4) where we have defined ϑ := σ 2 ϑ for the sake of simplicity. Given the above class of ARD models, there are now different inference strategies: – In [8] the Relevance Vector Machine (RVM) was proposed as a (partially) Bayesian strategy: integrating out the expansion coefficients in the posterior distribution, one obtains an analytical expression for the marginal likelihood P (y ϑ , σ 2 ), or evidence, for the hyperparameters. For ideal Bayesian inference one should define hyperpriors over ϑ and σ, and integrate out these parameters. Since there is no closedform solution for this marginalization, however, it is common to use a Gaussian approximation of the posterior mode. The most probable parameters ϑM P are
Sparse Kernel Regressors
341
chosen by maximizing P (y ϑ , σ 2 ). Given a current estimate for the α vector, the parameters ϑk are derived as (ϑk )new = (1 − ϑk (A−1 )kk )/αk .
(5)
These values are then substituted into the posterior (3) in order to get a new estimate for the expansion coefficients: −1 T K y. (6) (α)new = K T K + σ 2 diag ϑ – The key idea of Adaptive Ridge (AdR) regression is to select the parameters ϑj by minimizing (4). Direct minimization, however, would obviously shrink all parameters to zero. This is clearly not satisfactory, and can be avoided by applying a constraint of the form N 1 1 1 ϑi > 0, (7) i=1 ϑi = λ , N where λ is a predefined value, (cf. [5]). This constraint connects the individual variances of the ARD prior by requiring that their mean variance is proportional to 1/λ. The idea behind (7) is to start with an ridgetype estimate (ϑi = λ ∀i) and then introduce a method of automatically balancing the penalization on each variable. Before going into details, the reader should notice that both approaches are conceptually equivalent in the sense that they share the same idea of using an ARD prior, and they both employ some pragmatic procedure for deriving the optimal prior parameters ϑ. For a detailed analysis of AdR regression, it is useful to introduce a Lagrangian formalism. The Lagrangian for minimizing (4) under the constraint (7) reads N 2 N 1 N − L = y − Kα + i=1 ϑi αi2 + µ (8) i=1 ϑi λ . For the optimal solution, the derivatives of L with respect to both the primal and dual variables must vanish. This means ∂L/∂αk = 0, ∂L/∂ϑk = 0, ∂L/∂µ = 0. It follows that for given parameters ϑ the optimizing coefficients αi are derived as −1 T new (α) = K T K + diag {ϑ} K y, (9) √ and we find (ϑk )new = µ / αk . Together with the stationarity conditions, the optimal parameters are derived as N (ϑk )new = (λ/N ) α  / αk . (10) i i=1 During iterated application of (9) and (10), it turns out that some parameters ϑj approach infinity, which means that the variance of the corresponding priors p(αj ϑj ) becomes zero, and in turn the posterior p(αj  y, ϑ , σ 2 ) becomes infinitely peaked at zero. As a consequence, the coefficients αj are shrinked to zero, and the corresponding variables (the columns of the kernel matrix K) are removed from the model. Note that besides the conceptual equivalence between AdR regression and the RVM, both approaches are also technically very similar in the following sense: they both share the same update equations for the coefficients αi , (9) and (6), and the hyperparameters ϑk in (10) and
342
Volker Roth
3
3
2.5
2.5
2
2
1.5
1.5
1
1
0.5
0.5
0
0
0.5
0.5
1
1
1.5
1.5
2
10
0
+10
3
10
0
10
Fig. 1. Fitting the noisy sincfunction. First left: AdR regression, first right: RVM. The original sinc function is depicted by the dashed curve, the relevance vectors by the black circles. For comparison, also the SVM solution is plotted. The insensitive tube is depicted by the two dotted curves around the fit.
2.5 2 1.5 1 0.5 0 0.5 1 1.5 2
2
10
0
10
(5) are inverse proportional to the value of the corresponding weights αk . It is thus not surprising, that both methods produce similar regression fits, see figure 1. We have chosen the popular example of fitting the noisy sincfunction. For comparison, we have also trained a SVM. This simple yet intuitive experiment nicely demonstrates one of the most outstanding differences between the ARDmodels and the SVM: the former produce solutions which are usually much sparser as the SVM solutions, sometimes by several orders of magnitude. This immediately illustrates two advantages of the ARD models: (i) the extreme sparsity may be exploited to develop highly efficient training algorithms, (ii) once we have a trained model, the prediction problem for new samples can be solved extremely fast.
3 An Efficient Algorithm for AdR Regression Both the above iterative AdR algorithm and the original RVM algorithm share two main drawbacks: (i) convergence is rather slow, thus many iterations are needed, (ii) solving eq. (9) or (6) for the new αi means solving a system of N linear equations in N variables, which is very time consuming if N becomes large. In the case of AdR regression, the latter problem can be overcome by applying approximative conjugate gradient methods. Because of (i), however, the overall procedure still remains rather time consuming. For the original RVM, even this speedup is not suitable, since here of the hyperparameters require us to explicitly the update equations invert the matrix K T K + diag {ϑ} anyway. However, these computational problems can be overcome by exploiting the equivalence of AdR regression and the so called Least Absolute Shrinkage and Selection Operator (LASSO), see [5, 7]. Since space here precludes a detailed derivation, we only notice that it can be derived directly from the Lagrangian formalism introduced in the last section. Minimizing (8) turns out to be equivalent to solving the LASSO problem, which can be interpreted as 1 penalized leastsquares regression:
Sparse Kernel Regressors
minimize
N
i=1
2
(yi − (Kα)i )
subject to
α1 < κ.
343
(11)
It is worth noticing that the equivalence holds for any differentiable loss function. In particular, this allows us to employ robust loss functions which make the estimation process less sensitive to outliers in the data. We will return to this point in section 4. The real payoff of reformulating AdR regression in terms of the LASSO is that for the latter problem there exist highly efficient subset algorithms that directly make use of the sparsity of the solutions. Such an algorithm was originally introduced in [6] for linear least squares problems. In the following it will be generalized to both nonlinear kernel models and to general loss functions. Denoting a differentiable loss function by L, the Lagrangian for the general LASSO problem can be rewritten as L(α, λ) =
N
i=1
L (yi − (Kα)i ) − µ(κ −
N
j=1
αj ).
(12)
According to the KuhnTucker theorem, the partial derivatives of L with respect to αi and µ have to vanish.Introducing the function ω(t) = ∂L(t)/(t ∂t) and the diagonal matrix Ω(α) = diag ω [Kα − y]i , the derivative of L w.r.t. α reads (cf. [3, 6]) T
∇α L = K Ω(α)r + µv = 0,
sign(αi ) if αi = 0 with vi = a ∈ [−1, 1] if αi = 0.
(13)
In the above equation r = y − Kα denotes the vector of residuals. For the derivation it is useful to introduce some notations: from the form of v it follows that v∞ = 1 which implies µ = K T Ω rˆ∞ . To deal with the sparsity of the solutions, it is useful to introduce the permutation matrixP .It collects the nonzero coefficients of α in the first σ components, i.e. α = P T α0σ . Furthermore, we denote by θ σ a sign vector, θ σ = sign(ασ ). An efficient subset selection algorithm, which heavily draws on [6], can now be outlined as follows: given the current estimate α, the key idea is to calculate a new search direction h = P T h0σ locally around α. This local problem reads minimize h
N
i=1
L [yσ ]i − [Kσ (ασ + hσ )]i s.t.
θσT (ασ + hσ ) ≤ κ
(14)
For quadratic loss function this problem can be solved analytically, otherwise it defines a simple nonlinear optimization problem in σ variables. The problem is simple, because (i) it is usually a lowdimensional problem, σ N , (ii) for a wide class of robust loss functions it defines a convex optimization problem, (iii) either the constraint is inactive, or the solution lies on the constraint boundary. In the latter case, (if the unconstrained solution is not feasible) we have to handle only one simple linear equality constraint, θσT hσ = κ ˜ . (iv) Our experiments show that the number of nonzero coefficients σ is usually not correlated to the sample size N . It follows that even largescale problems can be solved efficiently. The iteration is started from α = 0 by choosing an initial s to insert into σ and solving the resulting onevariable problem. With the concept of sign feasibility (cf. [1]), the algorithm proceeds as follows:
344
Volker Roth
Check if α† := α + h is sign feasible, i.e. if sign(α†σ ) = θ σ . Otherwise: – (A1) Move to the first new zero component in direction h, i.e. find the smallest γ, 0 < γ < 1 and corresponding k ∈ σ such that 0 = αk + γhk and set α = α + γh. – (A2) There are two possibilities: 1. Set θk = −θk and recompute h. If (α + h) is sign feasible for the revised θσ , set α† = α + h and proceed to the next stage of the algorithm. 2. Otherwise update σ by deleting k, resetting αk and θk accordingly, and recompute h for the revised problem. – (A3) Iterate until a sign feasible α† is obtained. Once sign feasibility is obtained, we can test optimality by verifying (13): calculate v† T Ω(α† )r † = P T v1† . v † = KKT Ω(α † )r † σ
∞
2
By construction (v †1 )i = θi for i ≤ σ, and if −1 ≤ (v †2 )i ≤ 1 for 1 ≤ i ≤ N − σ, then α† is the desired solution. Otherwise, one proceeds as follows: – (B1) Determine the most violated condition, i.e. find the index s such that (v †2 )s has maximal absolute value. – (B2) Update σ by adding s to it and update α†σ by appending a zero as its last element and θσ by appending sign(v †2 )s . – (B3) Set α = α† , compute a new direction h by solving (14) and iterate.
4
Experiments
In a first example, the prediction performance of LASSO regression is demonstrated for Friedman’s benchmark functions, [4]. Since only relatively small learning sets are considered, we postpone a detailed analysis of computational costs to the experiments below. The results are summarized in table 1. Table 1. Results for Friedman’s functions. Mean prediction error ( 100 randomly generated 240/1000 training/test splits) and #(support/relevance vectors). SVM/RVM results are taken from [8].
Dataset SVM RVM LASSO #1 2.92 / 116.6 2.80 / 59.4 2.84 / 73.5 #2 4140 / 110.3 3505 / 6.9 3808 / 14.2 #3 0.0202 / 106.5 0.0164 / 11.5 0.0192 / 16.4
It should be noticed that all three models attain a very similar level of accuracy. Distinct differences, however, occur in the number of support/relevance vectors: the models employing ARD priors produce much sparser solutions than the SVM, in accordance with the results from figure 1. As realworld examples, we present results for the “houseprice8L” and “bank32fh” datasets from the DELVE benchmark repository2 . We compared both the prediction accuracy and the computational costs of RVM, SVM3 and LASSO for different sample sizes. The results are summarized in table 2. From the table we conclude, that (i) the prediction accuracy of all models is comparable, (ii) the ARD models are sparser than the SVM by 12 orders of magnitude, (iii) the RVM has 2 3
The datasets are available via http://www.cs.toronto.edu/˜delve/delve.html We used the SVMTorch V 3.07 implementation, see [2].
Sparse Kernel Regressors
345
severe computational problems for large training sets, (iv) the LASSO combines the advantages of efficiently handling large training sets and producing extremely sparse solutions, see also figure 2. Concerning the training times, the reader should notice that we are comparing the highly tuned SVMTorch optimization package, [2], with a rather simple LASSO implementation, which we consider to yet possess ample opportunities for further optimization. Table 2. Results for the “houseprice8L” and “bank32fh” datasets from the DELVE repository. In all experiments RBF kernels are used. The times are measured on a 500 MHz PC. The last 3 columns show the time in seconds for predicting the function value of 4000 test examples. sample
MSE # SV/RV Rvm Svm Lasso Rvm Svm Lasso
1000 1099 1062 1075 2000 1048 1022 1054 4000  1012 1024
33 597 61 36 1307 63  2592 69
4000
SVMpred
14 1638 22  3402 23
prediction time [s] 4
RVMtrain
3000
3
2000
2
1
1000 SVMtrain LASSOtrain 1000
2000
3000
4.2 · 103 33 26 3.5 · 104 101 72 428 312
0.1 1.4 0.1 3.5 8
0.2 0.2 0.2
0.07 6 13
0.1 0.1
bank32fh
(·10−3 )
7.41 7.82 7.39  7.75 7.49
training time [s]
tlearn [s] ttest [s] Svm Lasso Rvm Svm Lasso
houseprice8L
(·103 )
2000 4000
Rvm
4000
sample size
LASSOpred RVMpred
3 · 104 
15 83
24 102
Fig. 2. Computation times for the “house8L” dataset. Solid lines depict training times, dashed lines depict prediction times for a test set of size 4000. In the training phase, both SVM and LASSO are clearly advantageous over the RVM. For predictions, the ARD models outperform the SVM drastically. Note that the prediction time solely depends on the number of nonzero expansion coefficients, which for ARD models roughly remains constant with increasing size of the training set.
In the experiments presented so far we exclusively used quadratic loss functions. It is worth noticing that within the whole DELVE archive we could not find a single problem for which a robust loss function significantly improved the accuracy. This, however, only means that the problems considered are too “wellbehaved” in the sense that they obviously contain no or only very few outliers. Here, we rather present an intuitive artificial example with a high percentage of outliers. Applications to relevant realworld problems will be subject of future work. We return to the problem of fitting the noisy sincfunction, this time however with additional 20% outliers drawn from a uniform distribution. The situation both for standard and robust LASSO is depicted in figure 3. The nonrobust LASSO approach is very sensitive to the outliers, which results from the quadratic growth of the loss function. The robust version employing a loss function of Huber’s type (see e.g. [3]) overcomes this drawback by penalizing distant outliers only linearly.
346
Volker Roth 5
5
4
4
3
3
2
2
1
1
0
0
1
1
2
2
3
3 4
4 5
10
0
10
5
10
0
10
Fig. 3. LASSO results for fitting the noisy sincfunction with 20 % outliers. Left: quadratic loss, right: Huber’s robust loss (region of quadratic growth depicted by the two dotted curves).
5
Discussion
Sparsity is an important feature of kernel regression models, since it simultaneously allows us to efficiently learn a regression function and to efficiently predict function values. For the SVM, highly tuned training algorithms have been developed during the last years. However, the SVM approach still suffers from the steep growth of the number of support vectors with increasing training sets. Experiments in this paper demonstrate that ARD models like RVM and LASSO produce solutions that are much sparser that the SVM solutions. Moreover, the number of relevance vectors is almost uncorrelated with the size of the training sample. Within the class of ARD models, however, the original RVM algorithm suffers from severe computational problems during the learning phase. We could demonstrate that the “kernelized” LASSO estimator overcomes this drawback while adopting the advantageous properties of the RVM. In addition, we have shown that robust LASSO variants employing loss functions of Huber’s type are advantageous for situations in which the learning sample is corrupted by outliers. Acknowledgments. The author would like to thank J. Buhmann and V. Steinhage for helpful discussions. Thanks for financial support go to German Research Council, DFG.
References 1. D.I. Clark and M.R. Osborne. On linear restricted and interval leastsquares problems. IMA Journal of Numerical Analysis, 8:23–36, 1988. 2. Ronan Collobert and Samy Bengio. Support vector machines for largescale regression problems. Technical Report IDIAPRR0017, IDIAP, Martigny, Switzerland, 2000. 3. J.A. Fessler. Grouped coordinate descent algorithms for robust edgepreserving image restoration. In SPIE, Image reconstruction and restoration II, volume 3170, pages 184–194, 1997. 4. J.H. Friedman. Multivariate adaptive regression splines. Annals of Stat., 19(1):1–82, 1991. 5. Y. Grandvalet. Least absolute shrinkage is equivalent to quadratic penalization. In L. Niklasson, M. Bod´en, and T. Ziemske, editors, ICANN’98, pages 201–206. Springer, 1998. 6. M.R. Osborne, B. Presnell, and B.A. Turlach. A new approach to variable selection in least squares problems. IMA Journal of Numerical Analysis, 20(3):389–404, July 2000. 7. R.J. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, B 58(1):267–288, 1996. 8. M.E. Tipping. The relevance vector machine. In S.A. Solla, T.K. Leen, and K.R. M¨uller, editors, Neural Information Processing Systems, volume 12, pages 652–658. MIT Press, 1999.
Learning on Graphs in the Game of Go Thore Graepel, Mike Goutrié, Marco Krüger and Ralf Herbrich
Computer Science Department Technical University of Berlin Berlin, Germany
{guru,mikepg,grisu,ralfh}@cs.tuberlin.de
Abstract
We consider the game of Go from the point of view of machine learning and as a welldened domain for learning on graph representations. We discuss the representation of both board positions and candidate moves and introduce the common fate graph (CFG) as an adequate representation of board positions for learning. Single candidate moves are represented as feature vectors with features given by subgraphs relative to the given move in the CFG. Using this representation we train a support vector machine (SVM) and a kernel perceptron to discriminate good moves from bad moves on a collection of lifeanddeath problems and on 9 9 game records. We thus obtain kernel machines that solve Go problems and play 9 9 Go.
1
Introduction
Go (Chinese: WeiQi, Korean: Baduk) is an ancient oriental game about surrounding territory that originated in China over 4000 years ago. Its complexity by far exceeds that of chess, an observation well supported by the sucess of the computer program
Deep Blue.
In contrast, numerous attempts at reproducing
such a result for the game of Go have been unsuccessful. As a consequence, Go appears to be an interesting testing ground for machine learning [2]. In particular, we consider the problem of
representation
because an adequate representation
of the board position is an essential prerequisite for the application of machine learning to the game of Go. A particularly elegant statement of the rules of Go is due to Tromp and
1
Taylor an
and is only slightly paraphrased for our purpose: (1) Go is played on
N N
square grid of points, by two players called Black and White. (2)
Each point on the grid may be coloured said to reach a colour adjacent points of
P
C
black, white
or
empty.
A point
P
is
, if there exists a path of (vertically or horizontally)
's colour from
P
to a point of colour
C
. Clearing a colour
is the process of emptying all points of that colour that do not reach
empty.
(3)
Starting with an empty grid, the players alternate turns, starting with Black. A turn is either a pass; or a move that does not repeat an earlier grid colouring. A move consists of colouring an
1
empty
point one's own colour; then clearing the
See http://www.cwi.nl/~tromp/go.html for a detailed description.
G. Dorﬀner, H. Bischof, and K. Hornik (Eds.): ICANN 2001, LNCS 2130, pp. 347–352, 2001. c SpringerVerlag Berlin Heidelberg 2001
348
Thore Graepel et al.
opponent colour, and then clearing one's own colour. (4) The game ends after two consecutive passes. A player's score is the number of points of her colour, plus the number of empty points that reach only her colour. The player with the higher score at the end of the game is the winner. Equal scores result in a tie. While the majority of computer Go programs work in the fashion of rulebased expert systems [4], several attempts have been made to apply machine learning techniques to Go. Two basic learning tasks can be identied: (1) Learning an evaluation function for board positions. (2) Learning an evaluation function for moves in given positions. The rst task was tackled in the framework of reinforcement learning by Schraudolph and Sejnowski [7] who learned a pointwise evaluation function by the application of T D () to a multilayer perceptron (MLP). The second task found an application to tsume Go in [6] who used an MLP to nd problemsolving moves. All the known approaches suer from a rather naive representation. In this paper we introduce a new representation for both board positions and candidate moves based on what we call a common fate graph (CFG). Our CFG representation builds on ideas rst presented by Markus Enzensberger in the context of his Go program Neurogo II2 . While our discussion is focused on the game of Go, our considerations about representation should be of interest with regard to all those domains were the standard feature vector representation does not adequately capture the structure of the learning problem. A natural problem domain that shares this characteristics is the classication of organic chemical compounds represented as graphs of atoms (nodes) and bounds (edges) [5].
2
Representation
An adequate representation of the learning problem at hand is an essential prerequisite for successful learning. One could even go as far as saying that an adequate representation should render the actual learning task trivial. If the objects to be classied are represented such that the intraclass distances are zero, while the interclass distances are strictly greater than zero a simple nearest neighbour classier would be able to solve the learning problem perfectly. More realistically, we aim at nding a representation that captures the structure of board positions by mapping similar positions (in the sense of the learning problem) to similar representations. Another desirable feature of a representation is a reduction in complexity: only those features relevant to the learning problem at hand should be retained.
Common Fate Graph
The value of a given Go position is invariant under rotation and mirroring of the board. Also the rules of the game refer essentially only to the local neighbourhood structure of the game. A board position is thus adequately represented by its full graph representation (FGR), a graph with the 2
The Integration of A Priori Knowledge into a Go Playing Neural Network available via
http://www.cgl.ucsf.edu/home/pett/go/Programs/NeuroGoPS.html
Learning on Graphs in the Game of Go
(a: FGR)
(b: CFG)
349
(c: RSF)
Figure 1. Illustration of the feature extraction process: (a) board position (FGR), (b) corresponding common fate graph (CFG), and (c) a selection of extracted relative subgraph features (RSF) w.r.t. the node marked by a gray circle in (a) and (b).
structure of an N
N
undirected connected pi
2 Guc . The set
P
=
f
represents the points on the board. Also, each node p
given labels l : P E
square grid. Let us dene the FGR GFGR = (P; E ) as an graph GFGR
=
f1
!f
black; white; empty
E g with ei
e ; : : : ; eN
2 ff
0
p; p
g:
Pg
p1 ; : : : ; pN
2
of nodes
has any of three
P
g. The symmetric binary edge relation 2 g represents vertical or horizontal 0
p; p
P
neighbourhood between points.
black or white points that belong to the same chain common fate : either all of them remain on the board or all of
However, we observe that will always have a
them are cleared. In any case we can represent them in a single node. We also reduce the number of edges by requiring that any two nodes may be connected by only a single edge representing their neighbourhood relation. The resulting
reduced graph representation
will be called a
common fate graph
(CFG) and will
serve as the basic representation for our experiments. More formally, we dene
Guc ! Guc by the following rule: given two nodes g 2 and that have the same nonempty label , perform the following transformation (1) 7! n f g to into . (2) 7! ( n ff g 2 g) [ ff g : f g 2 g
the graph transformation T : 0
p; p l
2
P
f
that are neighbours 0
6
(p) = l (p ) =
p; p
0
E
empty 0
melt the node p
P
p
E
E
0
00
p ;p
E
00
to connect the remaining node p to those nodes p
00
p; p
0
P
0
p
00
p ;p
E
0
formerly connected to p .
Repeated application of the transformation T to GFGR until no two neighbouring nodes have the same colour leads to the common fate graph GCFG . The result of such a transformation is shown in Figure 1 (b). Clearly, the complexity of the representation has been greatly reduced while retaining essential structural information in the representation. In how far is the CFG a suitable representation for learning in the game of Go? Go players' intuition is often driven by a certain kind of aesthetics that refers to the local structure of positions and is called
good or bad shape.
As an
example consider the two white groups in Figure 1 (a) and (b). Although they look quite distinct to the layman in Figure 1 (a) they share the property of
being
350
Thore Graepel et al.
alive,
liberties
because they both have
called
).
two eyes
Relative Subgraph Features
(i.e. two isolated internal empty points
Unfortunately, almost all algorithms that aim
at learning on graph representations directly suer from severe scalability problems (see [5]). Most practically applicable learning algorithms operate on object representations known as feature vectors feature vectors
x from GCFG
x 2 Rd . We would thus like to extract
for learning. Both learning tasks mentioned in the
introduction can be formulated in terms of mappings from single points to real values. In both cases we would like to nd a mapping maps a particular node provided by the graph
p
2P
to a feature vector
x
: Guc P ! Rd
2 Rd
that
given the context
G = (P; E ) 2 Guc , an idea inspired by [5], who apply con
text dependent classication to the mutagenicity of chemical compounds. We
P ; E~i ) 2 Guc ; i = f1; : : : ; dg of G such that p 2 P xi = i (p) is then taken ~ can be found in G and proportional to the number of times ni the subgraph G i enumerate
d
G
possible connected subgraphs ~ i = ( ~i
~ . The relative subgraph feature i
kxk
= 1. Clearly, nding and counting subgraphs ~ i of
normalised to
G
G
becomes quickly infeasi
ble with increasing subgraph complexity as measured, e.g. by
G
jP~i jand jE~i j. We
therefore restrict ourselves to connected subgraphs ~ i with the shape of chains without branches or loops. In practice, we limit the features to a local context
jP~i j s, which given the other two constraints also limits the number d of distinguishable features.
3
Experimental Results
Learning Algorithms
In our experiments we focused on learning the dis
tinction between good and bad moves. However, we are really interested in the realvalued output of the classier which enables us to order and select candidate moves according to their quality. We choose the class of binary
P
kernel classiers for learning. The predictions of these classiers are given by
y^ (x) = sign (f (x)) = sign ( m i=1 i k (xi ; x)). The i are the adjustable paramd d eters and k : R R ! R is the kernel function. These classiers have recently gained a lot of popularity due to the success of the support vector machine
(SVM) [8]. We decided to use an RBF kernel with diagonal penalty term of the form
k (x; x ) = exp 0
kx
x k2 = 2 0
+
Ix=x , where 2 controls the width of 0
the kernel, and the second term makes the kernel matrix more wellconditioned. Two methods of learning the expansion coecients
i
were employed: (1) A
simple kernel perceptron [1] (2) A soft margin support vector machine
Life and Death as
3
tsume Go
3 [3]
An interesting challenge in the game of Go referred to
is to kill opponent groups and to save one's own groups from
Publicly available via http://www.kernelmachines.org/
Learning on Graphs in the Game of Go
351
test \ train white black w&b #probs #moves %good white 65.8/65.3 48.0/48.8 63.8/62.6 1455 10.4 13.4 black 57.7/57.3 66.4/64.5 65.9/65.9 1711 9.8 23.0 w&b 61.5/61.0 58.0/57.3 64.9/64.4 3166 10.0 18.0 # pts 2718 2682 5400 Results of the selection of tsume Go moves by an SVM (left) and a kernel perceptron (right). Shown is the percentage of nding a problemsolving move. The last column indicates the average percentage of moves that solve the problem. Table 1.
being killed. Our
tsume Go study was based on a database of 40000 computer
generated Go problems with solution by Thomas Wolf [9]. In order to calibrate the parameters of both the representation and the learning schemes, we created a simple task: The training set consisted of
m
= 2700 moves, where we took
the best move provided by Wolf 's problem solver GoTools [9] to be good and the worst one to be bad. The length
s
of the subgraph chains was chosen
= 6 resulting in d 400 features. Using a heldout data set of size mheldout 3000 we systematically scanned parameter space. For the SVM we set = 0 (relying on C for regularisation) and found the optimal parameter values to be = 1 and C = 10 . For the kernel perceptron we used the same value of and found = 0:05 to be optimal. Both solutions had a sparsity level kk0 =m of approximately 50% and lead to a success rate of up to 85% for to be
s
discriminating between the best and the worst move in a given problem.
In a similar setting, we applied the resulting classier to all the possible moves in a problem and used the realvalued output
f (x) for each legal move x
for ranking. We counted a success when the move ranked highest by the classier was indeed one of the winner moves. The results are given in Table 1. Focusing
w&b ) the success rate is more than 3 times
on the full training and test sets (
that of random guessing. Although we were not able to obtain the database
65% success rate compares 50% reported in [6] on similar problems using a naive local
used in [6] due to copyright problems, our result of a favourably with the
context representation.
Game Play
We applied the same learning paradigm as used above, but this
time to a collection of high quality (Shodan and better)
9 9 Go game records
collected from the Internet Go Server (IGS) and preprocessed and archived by Nicol Schraudolph. For each of the
m
2500 moves played we randomly
generated an arbitrary counterpart as a bad move. We provide two samples
4 in Figure 2. Connoisseurs of the game will
of the SVMs play against GnuGo
appreciate that the machine plays amazingly coherent considering that it takes into account only local shape and has no concept of territory and urgency.
4
Publicly available via http://www.gnu.org/software/gnugo/
352
Thore Graepel et al.
!
(
*
0
/
$
"
)
%
&
'
#
Æ
. Æ+ Æ, Æ (a)
Æ
!
"
(b)
Two examples of games played by an SVM (Black) against GnuGo (White). The SVM was trained on 9 9 game records from good amateur games. Figure 2.
Conclusion and Acknowledgements
We presented Go as a successful application of learning based on feature vectors that are extracted from a graph representation. While the approach appears to be particularly suitable for the game of Go it seems that other domains could benet from these ideas as well. We would like to thank Nici Schraudolph and Thomas Wolf for providing datasets. Also we thank the members of the computergo mailing list for invaluable comments and Klaus Obermayer for hosting our project.
References
1. M. Aizerman, E. Braverman, and L. Rozonoer. Theoretical foundations of the potential function method in pattern recognition learning. Automation and Remote Control, 25:821837, 1964. 2. J. Burmeister and J. Wiles. The challenge of go as a domain for ai research: A comparison between go and chess. In Proceedings of the 3rd Australian and New Zealand Conference on Intelligent Information Systems, 1994. 3. C. Cortes and V. Vapnik. Support Vector Networks. Machine Learning, 20:273297, 1995. 4. D. Fotland. Knowledge representation in The Many Faces of Go, 1993. 5. P. Geibel and F. Wysotzki. Learning relational concepts with decision trees. In Machine Learning: Proceedings of the Thirteenth International Conference, pages 11411144. Morgan Kaufmann Publishers, 1998. 6. N. Sasaki and Y. Sawada. Neural networks for tsumego problems. In Proceedings of the Fifth International Conference on Neural Information Processing, pages 1141 1144, 1998. 7. N. N. Schraudolph, P. Dayan, and T. J. Sejnowski. Temporal dierence learning of position evaluation in the game of go. In J. D. Cowan, G. Tesauro, and J. Alspector, editors, Advances in Neural Information Processing Systems, volume 6, pages 817 824. Morgan Kaufmann Publishers, Inc., 1994. 8. V. Vapnik. Statistical Learning Theory. John Wiley and Sons, New York, 1998. 9. T. Wolf. The program GoTools and its computergenerated Tsume Go database. In Proceedings of the 1st Game Programming Workshop, 1994.
Nonlinear Feature Extraction Using Generalized Canonical Correlation Analysis Thomas Melzer, Michael Reiter, and Horst Bischof The authors are with the Pattern Recognition and Image Processing Group, Vienna University of Technology, Vienna, Austria {melzer,rei,bis}@prip.tuwien.ac.at
Abstract. This paper introduces a new nonlinear feature extraction technique based on Canonical Correlation Analysis (CCA) with applications in regression and object recognition. The nonlinear transformation of the input data is performed using kernelmethods. Although, in this respect, our approach is similar to other generalized linear methods like kernelPCA, our method is especially well suited for relating two sets of measurements. The beneﬁts of our method compared to standard feature extraction methods based on PCA will be illustrated with several experiments from the ﬁeld of object recognition and pose estimation.
1
Introduction
When dealing with highdimensional observations, linear mappings are often used to reduce the dimensionality of the data by extracting a small (compared to the superﬁcial dimensionality of the data) number of linear features, thus alleviating subsequent computations. A prominent example of a linear feature extractor is Principal Component Analysis (PCA [4]). Among all linear, orthonormal transformations, PCA is optimal in the sense that it minimizes, in the mean square sense, the reconstruction error between the original signal x and ˆ reconstructed from its lowdimensional representation f (x). During the signal x the recent years, PCA has been especially popular in the object recognition community, where it has successfully been employed in various applications such as face recognition [13], illumination planning [9], visual inspection and even visual servoing [11]. Although this demonstrates the broad applicability of PCA, one has to bear in mind that the goal of PCA is minimization of the reconstruction error; in particular, PCAfeatures are not well suited for regression tasks. Consider a mapping φ : x → y. There is no reason to believe that the features extracted by PCA on the variable x will reﬂect the functional relation between x and y in any way; even worse, it is possible that information vital to establishing this relation is discarded when projecting the original data onto the PCAfeature space.
This work was supported by the Austrian Science Foundation (FWF) under grant no. P13981INF.
G. Dorﬀner, H. Bischof, and K. Hornik (Eds.): ICANN 2001, LNCS 2130, pp. 353–360, 2001. c SpringerVerlag Berlin Heidelberg 2001
354
Thomas Melzer, Michael Reiter, and Horst Bischof
There exist, however, several other linear methods that are better suited for regression tasks, for example Partial Least Squares (PLS [3]), Multivariate Linear Regression (MLR, also referred to as Reduced Rank Wiener Filtering, see for example [2]) and Canonical Correlation Analysis (CCA [5]). Among these three, only MLR gives a direct solution to the linear regression problem. PLS and CCA will ﬁnd pairs of directions that yield maximum covariance resp. maximum correlation between the two random variables x, y; regression can then be performed on these features. CCA, in particular, has some very attractive properties (for example, it is invariant w.r.t. aﬃne transformations  and thus scaling  of the input variables) and can not only be used for regression purposes, but whenever we need to establish a relation between two sets of measurements (e.g., ﬁnding corresponding points in stereo images [1]). As an example for CCA, consider constructing a parametric manifold for pose estimation [10]. Fig. 1(a) shows two extreme views of an object, which was acquired with two varying pose parameters (pan and tilt). Let X denote the set of training images and Y the set of corresponding pose parameters.The visualization of the manifold given in Fig. 1(b) is obtained by plotting the projections of the training set onto the ﬁrst three eigenvectors obtained by standard PCA, whereby neighboring (w.r.t. the pose parameters) projections are connected. 0.01
0.3
0.008
0.2
0.006
0.1
0.004
0
0.002
−0.1
0
−0.2
−0.002 −0.004
−0.3 −0.3 −0.2
−0.006 −0.1 0
−0.008
0.1 0.2 0.3
−0.4
−0.3
−0.2
−0.1
0
0.1
0.2
0.3
−0.01 −5
−4
−3
−2
−1
0
1
2
3
4
5 −3
x 10
(a)
(b)
(c)
Fig. 1. Extreme views of training set (a) and the parametric manifold obtained with PCA (b) and CCA (c).
The parametric manifold serves as a starting point for computing pose estimates for new input images. The standardapproach for retrieving these estimates is to resample the manifold using, e.g., bicubic spline interpolation and then to perform a nearest neighbor search for each new image [10]. Fig. 1(c) shows the manifold obtained by projecting the training images onto the ﬁrst two directions found by computing CCA on X and Y (the number of factors obtained by CCA is limited by the dimensionality of the lowerdimensional set). In contrast to the PCAmanifold, the CCAfactors span a perfect grid; one could also say that projections of the training images onto the two linear features found by CCA are topologically ordered w.r.t. their associated pose parameters. It is obvious that pose estimation on the manifold obtained by CCA is much easier than on the PCAmanifold.
Nonlinear Feature Extraction Using Generalized CCA
355
Recently, we proposed a nonlinear extension of CCA by the use of kernelmethods [7]. Kernelmethods have become increasingly popular during the last few years, and have already been applied to PCA [12] and the Fisher Discriminant [8]. In our derivation of kernelCCA we have used the fact that the solutions (principal directions) of CCA can be obtained as the extremum points of an appropriately chosen Rayleigh Quotient (this is also true for the other linear techniques discussed thus far, see [1]). In this paper we will demonstrate the beneﬁts of kernelCCA with an application in the ﬁeld of appearancebased pose estimation. To this end we will compare the performance of features obtained by PCA, standard CCA and kernelCCA. The rest of this paper is organized as follows: in section 2, we will give a brief introduction to “classical” CCA and show how kernelmethods can be applied to CCA. In section 3 we will apply kernelCCA for pose estimation. Conclusions will be given in section 4.
2 2.1
Canonical Correlation Analysis (CCA) What Is CCA?
Given two zeromean random variables x ∈ IRp and y ∈ IRq , CCA ﬁnds pairs of directions wx and wy that maximize the correlation between the projections x = wTx x and y = wTy y (in the context of CCA, the projections x and y are also referred to as canonical variates). More formally, CCA maximizes the function: E[xy] E[wTx xyT wy ] ρ= , = E[x2 ]E[y 2 ] E[wTx xxT wx ]E[wTy yyT wy ]
(1)
wTx Cxy wy ρ= . wTx Cxx wx wTy Cxy wy
(2)
Let
A=
0 Cxy Cyx 0
, B=
Cxx 0 0 Cyy
.
(3)
∗T T of ρ (i.e., It can be shown [1] that the stationary points w∗ = (w∗T x , wy ) the points satisfying ∇ρ(w∗ ) = 0) coincide with the stationary points of the Rayleigh Quotient: wT Aw , (4) r= T w Bw and thus, by virtue of the Generalized Spectral Theorem [2], can be obtained as solutions (i.e., eigenvectors) of the corresponding generalized eigenproblem:
Aw = µBw.
(5)
The extremum values ρ(w∗ ), which are referred to as canonical correlations, are equally obtained as the corresponding extremum values of Eq. 4 or the eigenvalues of Eq. 5, respectively, i.e., ρ(w∗ ) = r(w∗ ) = µ(w∗ ).
356
2.2
Thomas Melzer, Michael Reiter, and Horst Bischof
Kernel CCA
In this section we will brieﬂy summarize the formulation of kernelCCA, which can be used to ﬁnd nonlinear dependencies between two sets of observations. A detailed derivation of the algorithm can be found in [6]. Given n pairs of meannormalized observations (xTi , yTi )T ∈ IRp+q , and data matrices X = (x1 ..xn ) ∈ IRp×n , Y = (y1 ..yn ) ∈ IRq×n , we obtain the estimates for the covariance matrices A, B in Eq. 3 as ˆ = 1 A n
0 XYT YXT 0
ˆ = 1 , B n
XXT 0 0 YYT
(6)
If the mean was estimated from the data, we have to replace n by n − 1 in both equations. As we know from section 2.1, computing the CCA between the data sets X, Y amounts to determining the extremum points of the Rayleigh Quotient ∗T T (see Eq. 4). It can be shown [6], that for all solutions w∗ = (w∗T of x , wy ) ∗ ∗ Eq. 5, the component vectors wx , wy lie in the span of the training data (i.e., w∗x ∈ span(X) and w∗y ∈ span(Y)). Under this assumption for each eigenvector n ∗T T ∗ w∗ = (w∗T x , wy ) solving Eq. 5, there exist vectors f , g ∈ IR , so that wx = Xf ∗ and wy = Yg. Thus, CCA can completely be expressed in terms of dot products. This allows us to reformulate the Rayleigh Quotient using only inner products: 0 KL f f g LK 0 g , T T K2 0 f f g g 0 L2
T
T
(7)
where K, L are Gram matrices deﬁned by Kij = xTi xj and Lij = yiT yj , K, L ∈ IRn×n . The new formulation makes it possible to compute CCA on nonlinearly mapped data without actually having to compute the mapping itself. This can be done by substituting the gram matrices K, L by kernel matrices Kφ ij = φ(xi )T φ(xj ) = kφ (xi , xj ) and Lθ ij = θ(yi )T θ(yj ) = kθ (yi , yj ). kφ (., .), kθ (., .) are the kernel functions corresponding to the nonlinear mappings φ : IRp → IRm1 resp. θ : IRq → IRm2 . The projections onto wφ∗ can be computed using only the kernel function, without having to evaluate φ itself: φ(x)T wφ∗ =
n i=1
fφi φ(x)T φ(xi ) =
n
fφi k(x, xi ).
(8)
i=1
The projections onto wθ∗ are obtained analogously. Note that using kernelCCA, we can compute more than min(p, q) (p, q being the dimensionality of the variable x and y, respectively) factor pairs, which is the limit imposed by classical CCA.
Nonlinear Feature Extraction Using Generalized CCA
3
357
Experiments
In the following example we apply CCA to a pose estimation problem, where we relate images of objects at varying pose to corresponding pose parameters. Experiments were conducted on three test objects, detailed quantitative ﬁgures are given only for object 2(a). However, results for the remaining objects are similar.
(a)
(b)
(c)
(d)
(e)
(f)
Fig. 2. (a) Image x1 of one of our test objects. (b) Canonical factor w∗x computed on Xt8 . (c),(d) Canonical factors obtained on the same set using a nonlinear (trigonometric) representation of the output parameter space. (e),(f) 2 factors obtained by kernelCCA with canonical correlations ρ = 1.0.
Let X = xi 1 ≤ i ≤ 180 denote the set of images and Y = yi yi ∈ [0, 358] corresponding pose parameters (horizontal orientation of the object w.r.t. the camera in degrees). The images are represented as 1282 dimensional vectors that are obtained by sequentially stacking the image columns. Each image set X was subsampled to obtain a subset of images that was used as a training set. The remaining images were assigned to the corresponding test set. Let Xk denote a training set that was generated by subsampling X at every kth position1 . Since yik is a scalar, standard CCA yields only one feature vector (canonical factor) w∗x (see ﬁgure 2(b)). Figure 4 (in the upper 2 plots) gives a quantitative comparison of the pose estimation error (linear regression model) using the PCA and standard CCA. The pose estimation error was calculated on the test set as the absolute diﬀerence between known and estimated parameter value (orientation in degrees). Figure 3(a) shows a plot of pose estimates for the complete set of images X obtained using CCA. The dotted line indicates the true pose parameter values yi . Pose parameter values of Xk are marked by ﬁlled circles. In ﬁgure 3 a linear leastsquare mapping was used to map the projections of the training set to pose parameters. Note that the pose estimation error grows rapidly around i = 180. This problem is due to the fact that the scalar representation for yi has a discontinuity at yi = 360. From ﬁgure 2(a) it can be seen that the main part of information held in w∗x is about the transition of image x180 to image x1 . For this reason we chose an periodic, trigonometric representation of pose T parameters yi = [sin(yi ), cos(yi )] . The object’s pose is now characterized by 2 1
Xk contains all images xik with ik ∈ {1, k, 2k, . . . , 180/k}. Since the original set shows the object in 2 degrees steps, Xk shows the object at 2k degrees steps.
358
Thomas Melzer, Michael Reiter, and Horst Bischof
parameters, and thus CCA yields 2 canonical factors (see ﬁgure 2 (c) and (d)) in the image vector space. Estimates obtained for these intermediate output values (again using linear regression) are given in 3(b), while the ﬁnal pose estimates (obtained by combining these two estimates using atan2) are given in ﬁgure 3(c). Thus, by using a priori knowledge about the problem domain a signiﬁcant increase in pose estimation accuracy could be obtained.
400
400
1.5
350
0.03
350 1
0.02
300
300
250
0.5
0.01
250 200 200
0
0
150 150 100
−0.5
−0.01
100
50 −1
−0.02
50
0
−50
0
20
40
60
80
100
(a)
120
140
160
180
−1.5
0
20
40
60
80
100
120
140
160
(b)
180
0
0
20
40
60
80
100
(c)
120
140
160
180
−0.03
0
20
40
60
80
100
120
140
160
180
(d)
Fig. 3. Output parameter estimates obtained from feature projections by the linear regression function of the training set. Horizontal axes correspond to the image indices, vertical axes to estimated output parameter values. The dotted line indicates the true pose parameter values. Parameter values of the training set are marked by ﬁlled circles. (a) shows estimates using scalar representation of orientation (b) using trigonometric representation. (c) Estimated pose obtained from trigonometric representation using the fourquadrant arc tangens. Note that the accuracy of estimated pose parameters can be improved considerably. (d) Projections onto factors obtained using kernelCCA: optimal factors can be obtained automatically.
The last experiment shows the relative performance of CCA, kernelCCA and PCA when using spline interpolation and resampling [10] for pose estimation. For standard CCA we used the hardcoded trigonometric pose representation (yielding 2 factor pairs), while kernelCCA (using a RBFkernel with σ = 1.6) obtained similar factors automatically from the original scalar pose representation (see ﬁgures 2 (e)(f) and 3(d)). Results are given in ﬁgure 4 (middle row). The lowermost two plots of ﬁgure 4 show results for PCA. The pose estimation error is signiﬁcantly larger compared to CCA when using the same number of features.
4
Conclusion and Outlook
Although little known in the ﬁeld of pattern recognition and signal processing, CCA is a very powerful and versatile statistical tool that is especially well suited for relating two sets of measurements. CCA, like PCA, can also be regarded as a linear feature extractor. CCAfeatures are, however, much better suited for regression tasks than features obtained by PCA; this was demonstrated in section 1 for the task of computing a parametric object manifold for pose estimation.
Nonlinear Feature Extraction Using Generalized CCA penguin : PCA 10 features, lin.regr.
70
absolute error (degrees)
absolute error (degrees)
50
50
40
30
20
30
20
0
0 1
2
3
4
5
6
7
sampling interval
8
9
10
11
1
penguin : CCA, 2 trigonometric features, spline int/res
10
5
0
2
3
4
5
6
7
sampling interval
8
9
10
11
penguin : Kernel CCA, 2 features, spline int/res
15
absolute error (degrees)
15
absolute error (degrees)
40
10
10
10
5
0 1
2
3
4
5
6
7
sampling interval
8
9
10
1
penguin : PCA, 2 features, spline int/res
2
10
5
0
3
4
5
6
7
sampling interval
8
9
10
penguin : PCA, 10 features, spline int/res
15
absolute error (degrees)
15
absolute error (degrees)
penguin : CCA, 1 feature, lin. regr.
60
60
359
10
5
0 1
2
3
4
5
6
7
sampling interval
8
9
10
1
2
3
4
5
6
7
sampling interval
8
9
10
Fig. 4. Uppermost two plots: Comparison of pose estimation error when using a linear regression from factor projections to pose parameters. The left plot shows the errors for a 10 dimensional PCAfeature space. The right plot shows the results when using only one CCAfeature. Image sets have been subsampled at diﬀerent intervals (horizontal axis) to obtain increasingly smaller training sets. Subsampling was done at at 9 diﬀerent sampling intervals (k = 2, ..., 10). Note that the plots have been scaled diﬀerently. Plots in the middle row: Pose estimation error for CCAfeatures. The left column shows results for “classical” CCA when using a 2 dimensional trigonometric representation of output parameters. The right column shows results when using 2 features obtained by kernelCCA and a scalar pose representation. Lower two plots: Pose estimation error using the ﬁrst 10 eigenvectors (left plot) and only 2 eigenvectors. (right plot). Estimates were obtained from feature projections using spline interpolation and resampling.
360
Thomas Melzer, Michael Reiter, and Horst Bischof
In section 2, we discussed how to nonlinearly extend CCA by using kernelfunctions. KernelCCA is a eﬃcient nonlinear feature extractor, which also overcomes some of the limitations of classical CCA. Finally, in section 3, we applied kernelCCA to an object pose estimation problem. There, it was also shown that kernelCCA will automatically ﬁnd an optimal, periodic representation for a training set containing object views ranging from 0 to 360 degrees (i.e., for periodic data). Currently, we are investigating methods for obtaining the desired pose parameters (or, in general, output estimates) from the projections onto the transformed output basis vectors wφ , i.e., for inverting a kernelized output representation.
References 1. Magnus Borga. Learning Multidimensional Signal Processing. Link¨ oping Studies in Science and Technology, Dissertations, No. 531. Department of Electrical Engineering, Link¨ oping University, Link¨ oping, Sweden, 1998. 2. Konstantinos I. Diamantaras and S.Y. Kung. Principal Component Neural Networks. John Wiley & Sons, 1996. 3. A. H¨ oskuldsson. PLS regression methods. Journal of Chemometrics, 2:211–228, 1988. 4. H. Hotelling. Analysis of a complex of statistical variables into principal components. Journal of Educational Psychology, 24:498–520, 1933. 5. H. Hotelling. Relations between two sets of variates. Biometrika, 8:321–377, 1936. 6. Thomas Melzer, Michael Reiter, and Horst Bischof. Kernel CCA: A nonlinear extension of canonical correlation analysis. Submitted to IEEE Trans. Neural Networks, 2001. 7. Thomas Melzer and Micheal Reiter. Pose estimation using parametric stereo eigenspaces. In Tomas Svoboda, editor, Czech Pattern Recognition Workshop, pages 77–80. Czech Pattern Recognition Society Praha, 2000. 8. S. Mika, G. R¨ atsch, J. Weston, B. Sch¨ olkopf, and K.R. M¨ uller. Fisher discriminant analysis with kernels. In Y.H. Hu, J. Larsen, E. Wilson, and S. Douglas, editors, Neural Networks for Signal Processing, volume 9, pages 41–48. IEEE, 1999. 9. Hiroshi. Murase and Shree K. Nayar. Illumination planning for object recognition using parametric eigenspaces. IEEE Trans. Pattern Analysis and Machine Intelligence, 16(12):1219–1227, December 1994. 10. Hiroshi Murase and Shree K. Nayar. Visual learning and recognition of 3d objects from appearance. International Journal of Computer Vision, 14(1):5–24, January 1995. 11. Shree K. Nayar, Sameer A. Nene, and Hiroshi Murase. Subspace methods for robot vision. IEEE Trans. Robotics and Automation, 12(5):750–758, October 1996. 12. Bernhard Sch¨ olkopf, Alex Smola, and K.R. M¨ uller. Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10:1299–1319, 1998. 13. Matthew Turk and Alexander P. Pentland. Eigenfaces for recognition. Journal of Cognitive Neuroscience, 4(1):71–86, 1991.
Gaussian Process Approach to Stochastic Spiking Neurons with Reset Kenichi Amemori1,2 and Shin Ishii1,2 1
2
Nara Institute of Science and Technology, Takayama 89165, Ikoma, Nara 6300101, Japan CREST, Japan Science and Technology Corporation
[email protected] Abstract. This article theoretically examines the behavior of spiking neurons whose input spikes obey an inhomogeneous Poisson process. Since the probability density of the membrane potential converges to a Gaussian distribution, the stochastic process becomes a Gaussian process. With a frequentlyused spike response function, the process becomes a multipleMarkov Gaussian process. We develop a method which can precisely calculate the dynamics of the membrane potential and the firing probability. The effect of reset after firing is also considered. We find that the synaptic time constant of the spike response function, which has often been ignored in existing stochastic process studies, has significant influence on the firing probability.
1
Introduction
Stochastic nature of spikes has been reported in many areas of the cortex. Spike variability that seemingly obeys a Poisson process is observed not only at a low spontaneous firing rate but also at a high firing rate [7]. In spite of this variablity, stimulusinduced spike modulation can be observed in temporal patterns of spikes averaged over trials (e.g., poststimulus time histogram). For instance, MT neurons of a behaving monkey exhibit precisely modulated spikes, and the modulation is almost invariant over trials [3]. In the visual cortex, the neuronal responses vary a lot even when they are evoked by the same stimulus. However, by adding the response averaged over trials to the initial neuron state, the actual single response can be well estimated [2]. Accordingly, spikes are nonstational stochastic events, and the temporal behaviors can be extracted from the trial average of spikes. Here we assume that the spikes are approximately independent of each other, and the spike frequency may temporally change. Such stochastic nature is theoretically modeled by an inhomogeneous Poisson process. In this article, we theoretically examine the behavior of spiking neurons whose input spikes obey an inhomogeneous Poisson process. First, we develop a Gaussian process approach in order to precisely calculate the dynamics of the membrane potential and the firing probability. Our forward type calculation method is able to consider the effect of reset, which has not been dealt with in the existing backward type calculation method [4]. Second, we find that the synaptic time constant of the spike response function, which has often been ignored in existing stochastic process studies [8], has significant influence on the firing probability. G. Dorffner, H. Bischof, and K. Hornik (Eds.): ICANN 2001, LNCS 2130, pp. 361–368, 2001. c SpringerVerlag Berlin Heidelberg 2001
362
2
Kenichi Amemori and Shin Ishii
Stochastic Spiking Neuron Model
Spiking Neurons Spiking neuron models assume that neuronal activities are described based on the spikes that are input to or emitted by the neuron. The spike response model [5] is one of the spiking neuron models. In that model, membrane potential of a neuron is defined by the weighted summation of postsynaptic potentials, each of which is induced by a single input spike. The membrane potential is given by v(t) =
N
nj (t)
wj
j=1
u(t − tfj ).
f =1
Here, N is the number of synapses projecting to the neuron, wj is the transmission efficacy of synapse j, nj (t) is the number of spikes that enter through synapse j until time t and tfj is the time at which the f th spike enters through synapse j. u(s) is called the spike response function. In this article, we assume that the spike response function has the following function form: s s β τm − exp − H(s) = u(s) = (e−αs − e−βs )H(s),(1) exp − τ m − τs τm τs β−α 1 (s ≥ 0) where step function H(s) is defined by H(s) = . τm = 1/α is called 0 (s < 0) the membrane time constant, and τs = 1/β is called the synaptic time constant. When β → α, function (1) approaches s s H(s) = αse−αs H(s), exp − (2) u(s) = τm τm which is called alphafunction. When β → ∞, function (1) approaches s u(s) = exp − H(s) = e−αs H(s). τm
(3)
The shapes of the response functions are shown in Figure 1 (left). Poissonian Spikes Based on the spike response model, we examine stochastic properties of a single neuron, when the spikes input to the neuron obey an inhomogeneous Poisson process. The input spike sequence, {tfj f = 1, . . . , nj (t)}, is described by random variables. Subsequently, random variables are denoted by boldmath fonts. Given the spike response function u(s), we examine the property of membrane potential: v(t) =
N j=1
wj
n j (t) f =1
u(t − tfj ) ≡
N
xj (t),
j=1
where xj (t) is the summation of the postsynaptic potentials (PSPs) corresponding to synapse j. When the membrane potential v(t) becomes larger than the firing threshold θ,
1
Potential
Gaussian Process Approach to Stochastic Spiking Neurons with Reset
Eq. (1) Eq. (2) Eq. (3)
0.8 0.6
363
15
0 15
0.4
0 0.005
0.2 0
0
10 20 Time [ms]
30
0
100
0
200 300 Time [ms]
400
500
Fig. 1. (left) Shape of response function u(s), where α = 0.1 [ms]. The dash lines denote function (1), where β = 1.0, 0.3, and 0.15 [ms] from left to right. The dashdotted line denotes function (2), and the solid line denotes function (3). All of the functions are normalized so that their integrals are 1/α. (right) Examples of the dynamics of the membrane potential. Upper figure denotes the membrane potential when the absorbing threshold is assumed. Middle figure denotes the membrane potential when the reset after firing is considered. The input spikes obey an inhomogeneous Poisson process whose intensity is described in the lower figure. θ = 15 and β =0.3.
the neuron fires and the membrane potential is reset to zero. We examine the stochastic process of an ensemble of such neurons. If the number of synapses, N , is large, the density of the membrane potential v(t) can be approximated by a Gaussian distribution [4]. When spikes are assumed to obey an inhomogeneous Poisson process, all the spikes are mutually independent. xj (t) then becomes the summation of the independent random variables, and the probability density of xj (t) converges to a Gaussian distribution due to the central limit theorem. When the Poisson intensity is relatively large (e.g., λ(t) ≈ 3 [Hz] like a spontaneous firing rate) in comparison to the decay of the membrane potential (e.g., the membrane time constant τm ≈ 10[ms]), the Gaussian approximation describes the membrane potential very well [6]. This approximation is valid even when N is not large.
3 Gaussian Process In order to examine the stochastic process whose density is described by a Gaussian distribution for every time, we first calculate the joint probability density of the membrane potential. In the followings, the neuron behavior is assumed to be observed at every ∆t time step, i.e., tm = m∆t, where ∆t is sufficiently small. Joint Probability Density For expression simplicity, v tm is abbreviated as v m and the sequence of variables, {v m , . . . , v 1 }, is denoted by {v}m 1 . The joint probability density is defined by m 1 g({v}m 1 ) = δ(v − v(tm )) . . . δ(v − v(t1 )) = m m
dξ1 dξm iξk v k exp − iξk v(tk ) . ... exp 2π 2π k=1
k=1
364
Kenichi Amemori and Shin Ishii
By expanding the logarithm of the joint characteristic function, we obtain m m m m
1 iξk v(tk )) = − iξk η(tk ) − ξk ξl Ckl + O(ξk3 ), ln exp(− 2 k=1
k=1
k=1 l=1
where η(tk ) ≡ v(tk ) and Ckl ≡ v(tk )v(tl ) − v(tk ) v(tl ) . Under the Gaussian assumption, we can omit the third order terms O(ξk3 ). Then, the joint probability density is approximated as ∞ 1 dξ ) ≈ exp iξ (v − η) − g({v}m ξ Cξ , 1 m 2 −∞ (2π) where v = (v t1 , . . . , v tm ) , η = (η(t1 ), . . . , η(tm )) , ξ = (ξ1 , . . . , ξm ) , and C is the mbym autocorrelation matrix of v. C is positive definite. Prime ( ) denotes a transpose. By defining ξ = x + iy, we obtain ∞ 1 dξ 1 m x y g({v}1 ) = exp − Cx − (v − η) y + Cy + ix (v − η − Cy) . m 2 2 −∞ (2π) When y = C −1 (v − η), the integral is done only for the real part x. Then,
1 1 −1 ) = C (v − η) , exp − (v − η) g({v}m 1 2 (2π)m C
(4)
where C is the determinant of C. Subsequently, we analyze a Gaussian process defined by the autocorrelation matrix C. Time Evolution of Gaussian Process In order to see the characteristics of the autocorrelation matrix C, its incremental (recursive) definition along the time sequence {t1 , . . . , tm } is derived here. Let C n ≡ (cnij ) be the autocorrelation matrix at time tn (1 ≤ n ≤ m). Because synaptic inputs are statistically independent with each other, C n can be described by the weighted summation of autocorrelation of single synaptic inputs. Namely, min(i,j)
cnij = w2
u(ti − tk )u(tj − tk )λ(tk )∆t
(1 ≤ i, j ≤ n),
k=1
N n n 2 where w ≡ j=1 wj . Because C is symmetric and positive definite, C can be n n n decomposed into (U ) U , where U = (uij ) is an upper triangular matrix. The components of U n are described as wu(tj − ti ) λ(ti )∆t i≤j uij = 0 i > j. If response function (3) is used, C n is regular. If response function (1) or (2) is used, however, C n is singular, because the diagonal elements of un are zeros due to u(0) = 0. In order to avoid the singularity, we redefine the components of U m by i≤j wu(tj+1 − ti ) λ(ti )∆t uij = (5) 0 i > j.
Gaussian Process Approach to Stochastic Spiking Neurons with Reset
365
In both cases, the inverse of U n , Γ n , can be incrementally defined by 1 n− n n− − (Γ u ) Γ unn , Γn = 1 0 unn where un ≡ (u1n , . . . , un−1 n ) . The nth column of Γ n is constructed by the GramSchmidt orthogonalization. The inverse of C n is then obtained by (C n )−1 = Γ n (Γ n )
(n = 1, . . . , m).
(6)
Equation (6) introduces an incremental calculation of the inverse of the autocorrelation matrix C along the discretized time sequence. When response function (3) is used, the inverse autocorrelation matrix becomes a tridiagonal matrix. For response function (1) or (2), the inverse autocorrelation matrix becomes a band diagonal matrix whose bandwidth is five. This is because these response functions are the solutions of the first or second order linear ordinary differential equations. This simplicity also implies that the transition probability density defined by the autocorrelation matrix is strictly determined by a multiple Markov process along the discretized time sequence. Namely, function (3) introduces a single Markov process, and function (1) or (2) introduces a double Markov process. Transition Probability Density Time evolution is determined by transition probabilm−1 ). From equation (4), the transition ity density g(v m {v}1m−1 ) = g({v}m 1 )/g({v}1 probability density becomes 2 m−1 i v m − η(tm ) − i=1 κm 1 i (v − η(ti )) m−1 m , g(v {v}1 ) = √ exp − 2γ m 2πγ m where conditional variance γ m and correlation coefficients κm i are obtained from equation (6) as m γ m = C m /C m−1  = (umm )2 , κm i = umm (Γ )im . N m−1 uni = unm By defining w0 = j=1 wj , we can prove (proof is omitted) i=1 κm m−1 m i w0 n w0 and η(tn ) = w λ(ti )∆t. Then, η(tm ) = i=1 uin i=1 κi η(ti ) + w umm λ(tm )∆t. If a dple Markov process is assumed, κm i = 0 (i = 1, . . . , m − d). In this case, the transition probability becomes
g(v m {v}1m−1 ) = √
1 2πγ m
× exp −
vm −
w0 w umm
λ(tm )∆t − 2γ m
m−1
i=m−d
i κm i v
2 . (7)
This equation means that the transition probability density is described by λ(tm ) and umm ,. . . , um−d m . Namely, the transition probability is determined by the present intensity and the membrane properties for previous d steps. This is actually the dple Markov assumption itself.
366
Kenichi Amemori and Shin Ishii
Now the effect of the reset after firing is considered. If the membrane potential of a sample neuron exceeds the threshold, the potential is reset to zero. Since the threshold does not have any other effect, the transition probability at d steps after the firing can also be given by (7). Therefore, the density of the membrane potential at each time can be described using two terms, the (joint) probability density whose samples have fired and have not fired during the past d steps from the present time. Firing Probability and Threshold Effect. The joint probability density is decomposed as m−1 m )g(v m−1 {v}1m−2 ) . . . g(v 1 ). The density at time tm , whose g({v}m 1 ) = g(v {v}1 sample has not reached the threshold until time tm−1 , is described as m
g0 (v ) =
θ −∞
dv
m−1
...
θ
−∞
dv 1 g(v m {v}1m−1 )g(v m−1 {v}1m−2 ) . . . g(v 1 ).
∞ The firing probability at time tm is given by ρ(tm ) = θ g0 (v m )dv m . If a sample neuron fires at time tm , its potential is reset to zero. Therefore, the density for potential zero increases by ρ(tm ). Under the single Markov assumption (i.e., response function (3) is used): g(v m {v}1m−1 ) = g(v m v m−1 ), density g0 (v m ) is calculated by very simple equations ∞ g0 (v m )dv m ρ(tm ) = θ θ m g0 (v ) = g(v m v m−1 )g0 (v m−1 )dv m−1 + ρ(tm−1 )δ(v m ), −∞
where δ(·) is the Dirac’s delta function. Under the double Markov assumption (i.e., response function (1) or (2) is used): g(v m {v}1m−1 ) = g(v m v m−1 , v m−2 ), density g0 (v m ) is calculated by ∞ g0 (v m )dv m ρ(tm ) = θ θ θ g0 (v m ) = g(v m v m−1 , v m−2 )g0 (v m−1 , v m−2 )dv m−1 dv m−2 −∞ −∞ θ m m−1 g0 (v , v )= g(v m v m−1 , v m−2 )g0 (v m−1 , v m−2 )dv m−2 −∞
+ρ(tm−1 )g(v m , v m−1 ). The joint density of reset states, g(v m , v m−1 ), is obtained by (4). Note that it is difficult to consider the reset effect in the backward type calculation method [4], because the method does not obtain the density whose potential is lower than the threshold.
4
Experiment and Discussion
As a simplified model of the cerebral cortex, we consider a network of excitatory and inhibitory neurons. The number of excitatory inputs and inhibitory inputs are set at
Gaussian Process Approach to Stochastic Spiking Neurons with Reset
367
NE = 6000 and NI = 1500, respectively. We also set the synaptic efficacy values to be the same within each of the excitatory and inhibitory input groups, and they are given by wE = 0.21 and wI = 0.63, respectively. These experimental conditions follow [1]. As an example for the inhomogeneous Poisson process, we assume that the inhomogeneous Poisson intensity changes with time in accordance with λ(t) = λ(t) [Hz] t+τ 2 , where λ0 = 3.0 [Hz], λ = 1.0 [Hz] and + sin (2π ) λ0 + λ sin 2π t+τ τ τ τ = 0.2[s]. The numerical calculation is done with the time step ∆t = 0.5 [ms]. We set τm = 10 [ms]. Typical timeseries (sample paths) of the membrane potential with and without resets are shown in Figure 1 (right). Figure 2 shows the firing probability for various values for the synaptic time constant τs and threshold θ. In subfigures (a), (b) and (c), the reset effect is considered. These theoretical results agree very well with those of MonteCarlo simulations. It should be noted that only the approximation used in our theory is the Gaussian approximation of the membrane potential, which is a valid assumption. Figure 2 shows that the firing probability significantly depends on the synaptic time constant τs . This effects is prominent especially when threshold θ is large. This change of the firing probability is mainly due to the change of the potential variance σ 2 (t). Mean η(t) does not change very much because the integral of the response function is fixed, while variance σ 2 (t) decreases as τs increases. Conventionally, it has been considered that the change of the synaptic efficacy wj is important for neuronal information processing (like in the Hebb’s prescription). On the other hand, our theoretical result implies another possibility for the neuronal information processing, namely, the modulation of the firing probability by changing the synaptic time constant. When the threshold is relatively low in comparison to the rate of the input spikes, the firing probability is mainly determined by the input rate. In this case, the change of the synaptic time constant is not very important. When the threshold is relatively high, on the other hand, the abovementioned modulation is possible. In the cortical areas at which the firing rate is low and coincidental spikes often occur, neurons might process their information by temporally changing the shapses of PSPs.
5
Conclusion
The responses of spiking neurons to input spikes obeying an inhomogeneous Poisson process were theoretically examined. We developed Gaussian process approach in order to precisely calculate the dynamics of the membrane potential and the firing probability. Our forward type calculation method can consider the effect of resets. We have found that the synaptic time constant of the spike response function has significant influence on firing probability especially when the threshold is high.
References 1. Amit, D. & Brunel, N. (1997) Dynamics of a recurrent network of spiking neurons before and following learning. Network, 8: 373. 2. Arieli, A., Sterkin, A., Grinvald A., & Aertsen A. (1996) Dynamics of ongoing activity: explanation of the large variability in evoked cortical responses. Science 273: 18681871.
368
Kenichi Amemori and Shin Ishii 0.02 (a): thrshold = 18 0.01 0 0.04
(b): thrshold = 15 0.02 0
(c): thrshold = 12
0 15 10 5 0 5
(d): means of potential
Potential
Potential
0.05
(e): standard deviations of potential
0 0.005
0
(f): input intensity 0
50
100
150
200
250 Time [ms]
300
350
400
450
500
Fig. 2. Variation of the firing probability due to variation of the synaptic time constant τs ≡ 1/β. (a)(c): Firing probability density when the firing and reset are considered. The threshold is 18 for (a), 15 for (b), and 12 for (c). In each figure, solid lines are for β = ∞ (function (3)), 3.0 (function (1)), and α (function (2)) from top to bottom. (d): Means of the membrane potential when the firing is not considered. τs is not very important for the mean. (e): Standard deviations (SDs) of the membrane potential when the firing is not considered. SDs decreases as τs increases. (ae): The results of MonteCarlo simulations over 160000 trials are overwritten by dashed lines in all the subfigures. Difference between the theoretical results (solid) and the MonteCarlo results (dashed) is too small to observe in each subfigure. The maximum difference is about 0.002 for the firing probability density, and 0.02 for the mean and SD, which will decrease as the MonteCalro trials increase. (f) The input Poisson intensity.
3. Bair, W. & Koch, C. (1996) Temporal precision of spike trains in extrastriate cortex of the behaving macaque monkey. Neural Computation 8:11851202. 4. Burkitt, A.N. & Clark, G.M. (1999) Analysis of the synchronization of synaptic input and spike output in neural systems. Neural Computation 11:871901. 5. Gerstner, W. (1998) Spiking Neurons. In Pulsed Neural Networks, pp. 349. MIT Press. 6. Papoulis, A. (1984) Probability, Random Variables, and Stochastic Process. McGrawHill. 7. Shadlen, M.N. & Newsome W.T. (1998) The variable discharge of cortical neurons: implications for connectivity, computation, and information coding. Journal of Neuroscience 18: 38703896. 8. Tuckwell, H.C. (1988) Introduction to Theoretical Neurobiology: volume 2, Nonlinear and Stochastic Theories. Cambridge University Press.
Kernel Based Image Classiﬁcation Olivier Teytaud1 and David Sarrut2 1
Institut des Sciences Cognitives, UMR CNRS 5015 67 boulevard Pinel, F69675 Bron cedex, France
[email protected] 2 ERIC, Universit´e Lumi`ere Lyon 2, 4 av. P. Mend`esFrance, F69676 Bron Cedex, France
[email protected] Abstract. In this study, we consider lowlevel image classiﬁcation, with several machine learning algorithms adapted to high dimension problems: kernelbased algorithms. The ﬁrst is Support Vector Machines (SVM), the second is Bayes Point Machines (BPM). We compare these algorithms based on strong mathematical results and nice geometrical arguments in a feature space to the simplest algorithm we could imagine working on the same representation. We use diﬀerent lowlevel data, experimenting lowlevel preprocessing, including spatial information. Our results suggest that the kernel representation is more important than the algorithms used (at least for this task). It is a positive result because it exists much more simpler and faster algorithms than SVM. Our additive lowlevel preprocessings only improved success rate by few percents.
1
Introduction
In multiclass categorization of images, two families of approaches coexist; the ﬁrst is based on high level extracted features and the second on low level algorithms. In this work, we study the second approach, made practical, as shown by Chapelle et al [4], by the use of machine learning algorithms able of working in highdimension datasets: kernel algorithms. Jalam et al. [10] reported that a naive approach on the same kernel representation could be as eﬃcient as Support Vector Machines in the case of lowlevel text categorization. In this paper, we propose similar experiments in the case of images. Our goal is to test if fast wellknown algorithms could be as eﬃcient as other algorithms provided they use the same kernel representation. We also propose other lowlevel features. In a ﬁrst part, we describe classical kernelbased algorithms. Section 3 describes naive kernel based algorithms. We ﬁnally present experimental results and discuss our results.
2
Classical KernelBased Algorithms
In the following the training set is D = {(x1 , y1 ), ..., (xm , ym )}, with the xi ’s elements of the input space X, and the yi ’s elements of the ﬁnite output space G. Dorﬀner, H. Bischof, and K. Hornik (Eds.): ICANN 2001, LNCS 2130, pp. 369–375, 2001. c SpringerVerlag Berlin Heidelberg 2001
370
Olivier Teytaud and David Sarrut
Y . The xi ’s are images, encoded through features (see section 4), the yi ’s can belong to {−1, 1} for a twoclasses discrimination, or to {1, 2, ..., q} for qclasses discrimination. The goal is to ﬁnd an application from X to Y , minimizing the misclassiﬁcation probability of new samples. We denote by K : X × X → R a kernel on X. 2.1
Support Vector Machines and Multiclass Support Vector Machines
Support Vector Machines (SVM) are deﬁned in [15] and in many tutorials; they are the most usual algorithm for kernel learning. They are based on margin maximization, and choose linear separations in a feature space, using the kernel trick. For brevity, this algorithm and its justiﬁcations won’t be detailed here. The extension of SVM to the multiclass case was done by oneagainstall in Vapnik’s book [15]; hence x → fk (x) = sign( yi αi K(xi , x) + b) was computed for each class k against all the others, and x was classiﬁed in class k such that fk (x) is maximum. Guermeur et al. in [9] proposes another version of multiclass SVM, mathematically justiﬁed but more computationally expensive, sometimes leading to better results. 2.2
Bayes Point Machines
Herbrich et al. [6] consider that SVM work mainly as approximators of a Bayesian algorithm deﬁned by Rujan [11]. Their experiments suggest that, in the separable case, the algorithm called Bayes Point Machine, is more eﬃcient than SVM. The principle of the algorithm is the following one: 1) Let K be a kernel. x → K(x, x) must be constant equal to 1, 2) Consider the set H of kernel classiﬁers H = m {x → i=1 αi K(xi , x) /αi ∈ R}, 3) The version space V is deﬁned as the space of consistent hypothesis with norm 1: V = {w ∈ H / w = 1 and ∀i yi w(xi ) ≥ 0}, 4) The goal it to select as classiﬁer the average of V , for the metric in H : w = αi αj K(xi , xj )yi yj . This is done by playing billiard in the version space. The algorithm is the following one: 1) Find an initial point x0 ∈ V , 2) Choose randomly a direction d0 in H. Let i be equal to 0, 3) Compute di+1 , new direction after bounce on the boundary, 4) Replace i by i + 1 and return to 3). Choose as centre of mass the average point of this billiard (taking into account the metric; see Rujan [11] for details). As the billiard is not necessarily ergodic, a random step is added in reﬂections in order that the whole version space is uniformly covered (this ”artiﬁcial” ergodicity hasn’t been proved). The algorithm is the following one. Assuming that all vectors have norm 1 (for ex x−y2 ample K(x, y) = exp − σ2 ), one can see that hardmargin SVM (i.e. minimization of w2 under constraint ξi = 0) choose the w in the version space which is the centre of the maximum inscribed sphere. Herbrich et al. [6] explains that such point is an approximation of the average of the version space, and that when the version space has unregular shape the approximation is not eﬃcient. In these cases, their algorithm gives better results than SVM. The extension to the multiclass case can be done in a oneagainstall approach.
Kernel Based Image Classiﬁcation
371
We here propose a modiﬁed Bayes Point Machine, in which we do not average wx but sign(wx). Such technique has the drawback of not choosing a classiﬁer among a linear family of classiﬁers, but is theoretically closer to Bayesian inference and was experimentally slightly better on this particular benchmark.
3
Naive Kernel Based Algorithms
Description The Naive KernelBased Algorithm (NKBA) used in our experiments is summarized below: • Let (xi )i∈I be the family of classiﬁed examples (used for training). Let (xi ) be the family of points to be classiﬁed. • Let O be the matrix such that Oi,j = 1 if xi belongs to class j, −1 otherwise. • Let Ki,j = K(xi , xj ) and let Ki,j = K(xi , xj ) • Let W be such that K1 × W = O. K1 is the matrix K, plus one column ﬁlled with 1’s (K 1 below deﬁned similarly). W is chosen by Singular Value Decomposition with minimal square norm1 . • Let O = K 1 × W . We classify xi in class kˆ = arg maxk O (i, k). With K a Radial Basis Function, this is a classical RBF algorithm. Here we decide to use other functions as well. Notice that results of Sch¨olkopf et al. [13], emphasizing the superiority of SVM onto classical RBF approaches, do not apply here as we do not use backpropagation for any layer of weights. One can discuss the mathematical justiﬁcation of our algorithm, as we cannot justify it by the notion of geometrical margin w2 (as a regularization term) and VCdim bounds based upon this. However, Bartlett [1] recalls that the γempirical error (number of points in D for which yi f (xi ) ≤ γ) in classiﬁcation is bounded by the mean squared error divided by (1 − γ)2 . For ﬁxed γ, the γempirical error is so linearly bounded by the mean square error. A bound is easily derived, depending upon the size of weights, by combining results of [1]: the set of 2 functions H = {x →exp(− d (x,y) σ 2 )} for any d considered here are LLipschitz; γ so their γfatshattering dimension is bounded by the 2L covering number of 1 X, and so is O( γ dim ) with dim the dimension of X. If the sum of weights is γ bounded by A, then with d the 32A fatshattering dimension of H, then the γfat M shattering dimension of F = i=1 wi fi /M ∈ qN, wi ∈ R, fi ∈ H, i wi  ≤ A 2
2 d is bounded by cA γ 2 log (Ad/γ), with c a universal constant. So a tradeoﬀ between empirical error and norm of weights is justiﬁed. One can notice that in our experiments (part 4), allowing high values of M (number of functions) increases the empirical success rate, but never leads to overﬁtting. This suggests that the fatshattering dimension is more adapted to build bounds for this algorithm than arguments depending upon the number of weights. It was also the conclusion of [1] in another framework, with rigorous mathematical arguments. 1
Gaussian elimination is the fastest, but Singular Value Decomposition and House Holder are much better because of numerical stability reasons, see [5]
372
Olivier Teytaud and David Sarrut
Improvements A common possible improvement (both in computational time = K(xi , xj ) and memory space) consists in deﬁning Ki,j = K(xi , xj ) and Ki,j only for a subset S ⊂ I of possible values for j. This subset can be chosen by Kmeans (see [2]) or randomly. We decide to use a random set in order to preserve the fastness and simplicity of our approach. The size of the subset is chosen as the minimal size ensuring an optimal empirical success rate. Another improvement inspired by [11] and [6] has been tried: averaging O over multiple runs (Bayesian Inference). This is possible by introducing random noise on O, preserving the sign of its elements (multiplying each element by a random variable with values between 0 and 1). We notice that this technique combines two learning paradigms: Bayesian Inference (averaging among diﬀerent possible models) and addition of noisy examples to the training set (solution to overﬁtting/resistance to noise). We can also average on another random step: the choice of the subset S. This Bayesian Inference can be fastly computed by computing K1−1 . Then computing a new O is done simply by: • Computing K1−1 (pseudoinverse of K1, computational cost O(m3 ) with m the number of examples, by Singular Value Decomposition). • Generating O (depends on the random generator, not very timeconsuming). • Computing W by a matrix product (O(mpQ) with m the number of examples, p the number of selected examples, the elements xj for which K(xi , xj ) is computed, and Q the number of classes) • Verifying K1 × W = O (because of the heuristic consisting in choosing the smallest cardinal of S which ensures null empirical error rate) • computing O by a matrix product : O(tpQ) with t the number of examples to be tested. With this improvement, our Bayesian inference becomes as fast as Bayes Point Machines.
4
Experiments: Results in Image Classiﬁcation
Material and Method We used the image database called Corel14, described in [4,3], composed of 1400 images classiﬁed in 14 classes of 100 images (examples of classes are: “polar bears”, “air shows” . . . ). An image is denoted by 2 , with ni the number of sample along the ith axes, D the I : i=1 [1..ni ] → D domain of pixel value. Here the images are 24 bits color images and D is composed of three 8 bits channels R,G and B (Red, Green, Blue). These images are realworld images.There are numerous features which can be extracted to describe images. We decided to focus on lowlevel, highdimension, translation and rotation invariant features: typically histogrambased features. The features we selected are presented table 1 (with x = (x, y) a coordinate pixel). In table 1, #E denotes the size of set E. The two ﬁrst histograms HRGB and HHSV are simple color histograms based on diﬀerent color spaces. HSV (Hue Saturation Value) is considered as useroriented (by contrast with RGB considered as hardwareoriented ), based on the intuitive notion of tint, shade
Kernel Based Image Classiﬁcation
373
Table 1. Set of considered lowlevel image features HRGB HHSV H∇ H∆ H,d
D →R D →R [0 : 2π] →R D →R D × D →R
H(i) = #{x /I(x) = i} H(i) = #{x /I(x) = i} H(θ) = (∇x I, ∇y I) /angle (∇x I, ∇y I) = θ H(i) = #{x /∆(x) = i} H(i, j) = #{(x, x ) /I(x) = i , I(x ) = i and dist(x, x ) ≤ d}
and tone. Such color space is generally considered as a better image feature for machine learning than RGB (see [8] for RGB to HSV conversion). ∇x I = δI δI δx and ∇y I = δy denotes the two orthogonal gradient values at point I(x), 2
2
∆(x) = δδ2 xI + δδ2 yI the Laplacian. The gradient orientation histogram H∇ is forced to be translation and rotation invariant by ordering the angles θ according to θmax = arg maxθ H(θ) as origin. H∆ is the Laplacian histogram. H,d (i, j) is the number of occurrence of color i and j for points closer than a distance d. Except the two ﬁrst, each histogram is computed after a previous image smoothing by a Gaussian ﬁlter (of size 55 × 55 and σ = 9), in order to decrease potential noisy pixels. No other image preprocessing has been applied. We think H,d brings interesting features in the machine learning algorithm, because it takes into account a notion of neighborhood. However, the value of d is diﬃcult to choose a priori (we used d = 25). Moreover, in this ﬁrst study, we did not consider multiresolution version of histograms. Note that, as images are RGB images, there are diﬀerent ways to consider pixel value: for example, it would be possible to combine the three channels. However, we decided to take into account the three channels in order to make beneﬁt of the highdimension capabilities of the kernelbased machine learning. Hence, gradient or Laplacian are computed independently on each channel and each histogram takes his value in D which has 3 dimensions. In a future work, we will take into account several values of d and multiresolution features. The images was thus described by the set of presented features and the three presented methods (SVM, BPM, NKBA) have been applied. The learning set is composed of the two third of the images and the validation set is composed of the remaining images. Results are averaged on multiple runs. Tested kernels are the linear kernel (K(x, y) =< xy >) and kernel (a symmetrized version of χ2 dissim the Jambu
ilarity) K(x, y) = exp −
(xi −yi )2 xi +yi σ2
with σ an heuristically chosen parameter
(such that values of K(xi , xj ) for (i, j) ∈ I 2 are nearly homogeneous). Results Chapelle et al. [4] suggest some possible preprocessings (such as elevating frequencies at power 18 ). Probably classical improvements in text categorization would be usefull as well (see for example [12], replacing frequency α 1 for example). We do not study this preprocessings here. On the same by 1+α database, with diﬀerent approaches (based on feature extraction and decision
374
Olivier Teytaud and David Sarrut
trees), Carson et al. [3] report nearly 50 % of success (cited in Chapelle et al. [4]). knearest neighbour are not better, at least with the similarities used in our kernel algorithms. Table 2 presents the success rates according to diﬀerent methods and features. Using SVM and HRGB , Chapelle et al. [4] present slightly better results on the same benchmark with the same algorithm. We suppose a slight diﬀerence in preprocessing explains this diﬀerence. Diﬀerent approaches (RGB, Gradient, Laplacian, H , all with Jambu kernel) are combined through linear combination (using a Singular Value Decomposition algorithm to choose weights of combination); this leads to results of line ”Combined”. Our Bayes Inference and modiﬁed Bayes Point Machines algorithms achieve respectively 82.0% and 82.8% in the case of Jambu Kernel on RBG. Linear algorithms did not achieve as good results as Jambu kernel; their strength lies in recognition speed. Table 2. Diﬀerent success rates according to diﬀerent methods and diﬀerent features used. The two ﬁrst lines use the linear kernel, the others the Jambu kernel. Comparison SVM/NKBA Further experiments with NKBA Data SVM NKBA Gradient (Jambu) 58.5 % RGB (linear) 57.6 % 60.0 % Laplacian (Jambu) 60.0 % HSV (linear) 63.4 % 63.4 % H (Jambu) 73.4 % RGB (Jambu) 81.4 % 83.6 % Combined 85.9 % HSV (Jambu) 81.4 % 83.5 %
5
Conclusion
We did not achieve a signiﬁcant improvement by addition of information about proximity of colors, gradients, laplacian. The question of whether the 95% success rate can be achieved by such lowlevel tools remains opened. Further work might include local direction evaluation, Spatial Gray Level Dependencies, Gabor ﬁlters. The diﬀerence with the results given in Chapelle et al. ([4]) could be attributed to the diﬀerence of preprocessing or could also be due to a weakness of our implementation of SVM. SVM are a still recent learning algorithm, and diﬀerent implementations give diﬀerent results, because diﬀerent quadraticprogramming algorithms have diﬀerent behaviour (mainly, almost ”horizontal” objective function). As already observed in text categorization (Jalam et al. [10]), one can replace complex algorithms like Bayes Point Machines or Support Vector Machines, by old wellknown widely available algorithms. For SVM, Mercer’s condition doesn’t seem to be the reason which makes the diﬀerence between a good and a bad kernel. Both from a theoretical (thanks to covering numbers coming from fatshattering dimension) and practical point of view, simple algorithms can be used in this ﬁeld. In particular, computational time was much higher for SVM or BPM than for NKBA, even using a fast algorithm as SMO (Smola, [14]) for Support Vector Machines. An interesting point for a further work would be the use of the Lpmachine, a kernelbased algorithm using linear programming instead of quadratic, and which appears theoretically
Kernel Based Image Classiﬁcation
375
(because of weightsbased bounds like the ones used above, section 3) and practically (for its numerical behavior) interesting. In text categorization we did not get any improvement using ECOC (see Dietterich et al. [7]), a classical paradigm for extending twoclasses categorization to multiclass categorization. Moreover, the multiclass version of SVM proposed in [9] achieved better results than oneagainstall multiclass SVM, whereas [16] doesn’t mention any signiﬁcant improvement of one multiclass SVM onto the other. We suppose the large number of classes and the highdimensionnality nature of the problem explains the observed gap. In a further work we shall verify whether these results can be stated in histogrambased image classiﬁcation.
References 1. Bartlett P.L. 1998, The sample complexity of pattern classiﬁcation with neural networks: the size of the weights is more important than the size of the network, IEEE transactions on Information Theory, 44:525536 2. Bishop C.M. 1995, Neural Networks for Pattern Recognition, Oxford 3. Carson C., S. Belongie, H. Greenspan, and J. Malik 1998, Color and texturebased images segmentation using EM and its application to image querying and classiﬁcation, submitted to Pattern Anal. Machine Intell. 4. Chapelle O., P. Haffner, and V.N. Vapnik 1999, Support Vector Machines for Histogram Based Image Classiﬁcation, IEEE transactions on Neural Networks, Vol 10 5. Golub G., and C. Van Loan 1996, Matrix computations, third edition, The Johns Hopkins University Press Ltd., London 6. Herbrich R., T. Graepel, and C. Campbell 1999, Bayes Point Machines: Estimating the Bayes Point in Kernel Space, in Proceedings of IJCAI Workshop Support Vector Machines, pages 2327, 1999 7. Dietterich T.G., and G. Bakiri 1995, Solving Multiclass Learning Problems via ErrorCorrecting Output Codes. Journal of Artiﬁcial Intelligence Research 2: 263286,1995 8. Foley J., A. van Dam, S.Feiner and J. Hughes 1990, Computer graphics – Principles and Practice– AddisonWesley, 2nd edition 9. Guermeur Y., A. Elisseeff, and H. PaugamMoisy 2000, A new multiclass SVM based on a uniform convergence result. Proceedings of IJCNN’00, IV183 10. Jalam R., and O. Teytaud 2000, Text Categorization based on N grams, research report, Laboratoire ERIC ´n P. 1997, Playing Billiard in Version Space. Neural Computation 9, 99122 11. Ruja 12. Sahami M. 1999, Using Machine Learning to Improve Information Access, Ph.D. in Computer Science, Stanford University ¨ lkopf B., K. Sung, C. Burges, F. Girosi, P. Niyogi, T. Poggio, and 13. Scho V.N. Vapnik 1997, Comparing Support Vector Machines with Gaussian Kernels to Radial Basis Function Classiﬁers. IEEE Transactions on Signal Processing, 45(11):27582765 14. Smola, A.J. 1998, Learning with kernels, Ph.D. in Computer Science, Technische Universit¨ at Berlin 15. Vapnik V.N. 1995, The Nature of Statistical Learning, Springer 16. Weston J., and C. Watkins 1998, Multiclass Support Vector Machines, Univ. London, U.K.,m Tech. Rep. CSDTR9804
Gaussian Processes for Model Fusion Mohammed A. ElBeltagy1 and W. Andy Wright2 1
2
Southampton University , Southampton, UK
[email protected] BAE SYSTEMS (ATC Sowerby), FPC 267, PLC, PO Box 5, Filton, Bristol, UK