Eight half-day tutorials will be offered at APSIPA 2015. These tutorials are run in parallel, and are free of charge for conference participants.

Morning Session
9:00am-12:20pm, December 16, 2015
  1. Brain Computer Interface - overview of R&D and future outlook
    Prof. Sung Chan Jun, Gwangju Institute of Science and Technology, South Korea

  2. Assisted Listening for headphones and hearing aids: Signal Processing Techniques
    Woon-Seng Gan
    and Jianjun He, Digital Signal Processing Lab, School of EEE, Nanyang Technological University, Singapore

  3. Introduction to Deep Learning and its applications in Computer Vision
    Wanli Ouyang
    and Xiaogang Wang, Department of Electronic Engineering, the Chinese University of Hong Kong, Shatin, Hong Kong

  4. Biomedical signal processing problems of human sleep monitoring
    Tomasz M. Rutkowski
    , University of Tsukuba, Japan


Afternoon Session
2:00pm-5:20pm, December 16, 2015
  1. Depth-based Video Processing Techniques for 3D Contents Generation
    Yo-Sung Ho
    , Gwangju Institute of Science and Technology (GIST), Korea

  2. Graph Signal Processing for Image Compression & Restoration
    Prof. Gene Cheung
    , National Institute of Informatics, Tokyo, Japan
    Prof. Xianming Liu, Department of Computer Science, Harbin Institute of Technology (HIT), Harbin, China

  3. Spoofing and Anti-Spoofing: A Shared View of Speaker Verification, Speech Synthesis and Voice Conversion
    Zhizheng Wu
    , University of Edinburgh, UK
    Tomi Kinnunen, University of Eastern Finland, Finland
    Nicholas Evans, EURECOM, France
    Junichi Yamagishi, University of Edinburgh, UK

  4. Wireless Human Monitoring
    Tomoaki Ohtsuki
    and Jihoon Hong, Department of Information and Computer Science, Keio University

Title: Brain Computer Interface - overview of R&D and future outlook
Prof. Sung Chan Jun, Gwangju Institute of Science and Technology, South Korea

In this tutorial, we will mainly review BCI research and development trends from the past to the future. Basic components of BCI (calibration and feedback phases), experimental control paradigms (active, re-active, and passive paradigms), underlying methodological issues (related signal processing and machine learning techniques such as preprocessing, feature extraction, and classification) are presented in a more detailed manner. Further, variety applications of BCI are introduced and discussed in two main aspects (industrial and clinical aspects). One of primary hurdles in BCI development, BCI performance variation is reviewed and possible approaches to overcome is suggested. In addition, recent online survey on BCI and BCI games from about 300 worldwide respondents (users, researchers, and developers) conducted by tutorial author's group is presented. Lastly, future overlook on BCI research and development is discussed.

Speaker Biograph
Sung C. Jun majored in Mathematics and minored in Computer Science from Korea Advanced Institute of Science and Technology (KAIST), Korea. He received M.S. and Ph.D. in Applied Mathematics at the same institution in 1993 and 1998, respectively. In 1999-2000, he joined in the Korea Research Institute of Standard and Sciences (KRISS) and coordinated softeware development for Korean MEG system. In 2000-2002, he worked at the Department of Computer Science, University of New Mexico, USA as a post-doctoral researcher. Then he joined in the Biological & Quantum Physics Group, Los Alamos National Laboratory (LANL), USA as a post-doctoral associate in 2002 and he became Technical Staff Member there in 2004. After 5 years work of experience at the LANL, he moved to the Gwangju Institute of Science and Technology (GIST), Korea as a faculty member in 2007. He is currently working at GIST as associate professor and is leading the Biocomputing lab. His current research interests include biomedical imaging and biosignal processing with MEG/EEG, brain computer interface, computational modeling of electrical brain stimulation, and so on. He is a member of the IEEE, the Society for Neuroscience, the Asia-Paci?c Signal and Information Processing Association (APSIPA), and the Organization of Human Brain Mapping (OHBM). He serves as an active reviewer for about 10 prestigious international journals including Neuroimage, Human Brain Mapping, Physics in Medicine and Biology, Journal of Neural Engineering, Journal of Neuroscience Methods, and PLoS One.


Title: Assisted Listening for headphones and hearing aids: Signal Processing Techniques
Woon-Seng Gan
and Jianjun He, Digital Signal Processing Lab, School of EEE, Nanyang Technological University, Singapore

With the strong growth of the mobile devices and emerging virtual reality (VR) and augmented reality (AR) applications, headsets are becoming more and more preferable in personal listening due to its convenience and portability. Assistive listening (AL) devices like hearing aids have seen much advancement. Creating a natural and authentic listening experience is the common objective of these VR, AR, and AL applications. In this tutorial, we will present state-of-the-art audio and acoustic signal processing techniques to enhance the sound reproduction in headsets and hearing aids.

This tutorial starts with an introduction of the recent examples of audio applications in VR, AR, and AL. To ensure the tutorial is understandable to the novice audience, some background on spatial hearing fundamentals and different classes of spatial audio reproduction techniques will be briefly introduced. This is followed by an outline of the three key parts of this tutorial that focuses on binaural techniques, especially their connections.

In part I, we will address recent advances in rendering natural sound in headphones. Based on a source-medium-receiver model, we analyze the differences between headphone sound reproduction and natural listening, which lead to five categories of signal processing approaches that could be employed to reduce the gap between the two. The five categories are virtualization, sound scene decomposition, individualization, equalization, and head-tracking. At last, an integration of these techniques are discussed and illustrated with an exemplar system (a.k.a., 3D audio headphones) developed at our lab.

In part II, we will discuss natural augmented reality audio. Natural listening in augmented reality requires listener to be aware of surrounding acoustic scene. In augmented reality, virtual sound sources are superimposed with the real world such that listeners are able to connect with the augmented sound sources seamlessly. Three typical headset systems for augmented reality audio will be presented, including a natural augmented reality (NAR) headset developed at our lab. The NAR headset employs adaptive filtering techniques to adapt to the listener's specific responses, environmental characteristics, and compensate for the headphone response to achieve natural listening in real-time.

In part III, other aspects to augment human listening, i.e., reducing unwanted noise and enhance speech perception, will be discussed. We will present active noise control (ANC) techniques for headsets and discuss how to integrate ANC with sound playback. Moreover, noise reduction and speech enhancement in hearing aids will be presented, with a focus on the spatial information. Furthermore, ANC can also be incorporated in hearing aids to further reduce the ambient noise.

In the concluding part of the tutorial, we will provide some demonstrations (video and apps) to illustrate some of the advancements in assisted listening and natural sound rendering in headphones, and highlight new trends of signal processing approaches for natural and augmented listening in headsets.

This tutorial is an extension of the APSIPA 2014 Plenary Talk and also includes new work reported in recent publications published in the IEEE Signal Processing Magazine, March 2015 issue on Signal Processing Techniques for Assisted Listening.

Tutorial outline:
1. Introduction (30min)
   1.1. Emerging audio applications in virtual reality, augmented reality and assisted listening
   1.2. Fundamentals in human listening and spatial audio
   1.3. Overview of spatial audio techniques
   1.4. Outline of this tutorial
2. Part I: Natural sound rendering for virtual reality in headphones (40min)
   2.1. Challenges and overview of signal processing solutions
   2.2. Virtualization
   2.3. Sound scene decomposition
   2.4. Individualization
   2.5. Equalization
   2.6. Head-tracking
   2.7. An integration
   2.8. An example using 3D audio headphones
3. Part II: Natural Augmented reality audio in Headsets (45min)
   3.1. Natural listening in augmented listening: an overview
   3.2. Signal processing and practical challenges
   3.3. ARA headset for mobile and wearable devices
   3.4. Multichannel headphone sound reproduction
   3.5. Natural augmented reality (NAR) headset using adaptive techniques
4. Part III: Assisted listening in headsets and hearing aids (45min)
   4.1. Active noise control in headsets
   4.2. Integration of ANC and sound playback in headphones
   4.3. Noise reduction and speech enhancement in hearing aids
   4.4. Integration of ANC and noise reduction in hearing aids
5. Summary and future trends (20min)
   5.1. Summary
   5.2. Future trends
   5.3. Web links to learn more

Speaker Biographies

Woon-Seng Gan received his BEng (1st Class Hons) and PhD degrees, both in Electrical and Electronic Engineering from the University of Strathclyde, UK in 1989 and 1993, respectively. He is currently an Associate Professor and the Head of Information Engineering Division, School of Electrical and Electronic Engineering in Nanyang Technological University. His research interests span a wide and related areas of active noise control, adaptive signal processing, directional sound system, spatial sound processing, and real-time embedded systems.

Professor Gan won the Institute of Engineer Singapore (IES) Prestigious Engineering Achievement Award in 2001 for his work on Audio Beam System. He has published more than 230 international refereed journals and conferences, and has granted four US patents. He had co-authored a book on Digital Signal Processors: Architectures, Implementations, and Applications Prentice Hall, 2005). This book had since been translated to chinese for adoption by universities in China. He was also the leading author of a new book on Embedded Signal Processing with the Micro Signal Architecture, (Wiley-IEEE, 2007). A book on Subband Adaptive Filtering: Theory and Implementation was also published by John Wiley in August 2009. His had also co-authored a book chapter in Rick Lyon's latest book on Streamlining Digital Signal Processing: A Trick of the Trade Guidebook, 2nd Edition, published by Wiley-IEEE press, 2012.

Professor Gan is currently a Fellow of the Audio Engineering Society (AES), a Fellow of the Institute of Engineering and Technology (IET), a Senior Member of the IEEE, and a Professional Engineer of Singapore. In 2012, he has been the Series Editor of the new SpringerBriefs in Signal Processing. He is also an Associate Technical Editor of the Journal of Audio Engineering Society (JAES); Associate Editor of the IEEE Transactions on Audio, Speech, and Language Processing (ASLP); Editorial member of the Asia Pacific Signal and Information Processing Association (APSIPA) Transactions on Signal and Information Processing; and Associate Editor of the EURASIP Journal on Audio, Speech and Music Processing. He is a technical committee member of the Design and Implementation of Signal Processing Systems (DiSPS), and the Industry DSP Technology (IDSP) standing committee of the IEEE Signal Processing Society. Professor Gan is currently a member of the Board of Governor of the APSIPA (2013-2014) and also an APSIPA Distinguished Lecturer (2014-15).

Jianjun He received his B.ENG. degree in automation from Nanjing University of Posts and Telecommunications, China in 2011 and is currently pursuing his Ph.D. degree in electrical and electronic engineering at Nanyang Technological University (NTU), Singapore. In 2011, he was working as a general assistant in Nanjing International Center of Entrepreneurs (NICE), building platforms for start-ups from oversea Chinese scholars in Jiangning District, Nanjing, China. Since 2015, he has been a project officer with School of Electrical and Electronic Engineering in NTU. His Ph.D. work has been published in IEEE Signal Processing Magazine, IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP), IEEE Signal Processing Letters, and ICASSP, etc. He has been an active reviewer for IEEE TASLP, Journal of Audio Engineering Society, IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences, and IET Signal Processing, etc. Aiming at improving humans' listening, his research interests include audio and acoustic signal processing, 3D audio (spatial audio), psychoacoustics, active noise control, source separation, and emerging audio and speech applications. Currently, He is a student member of IEEE and Signal Processing Society (SPS), a member of APSIPA, and an affiliate member of IEEE SPS audio and acoustic technical committee.


Title: Introduction to Deep Learning and its applications in Computer Vision
Wanli Ouyang
and Xiaogang Wang, Department of Electronic Engineering, the Chinese University of Hong Kong, Shatin, Hong Kong

Deep learning has become a major breakthrough in artificial intelligence and achieved amazing success on solving grand challenges in many fields including computer vision. Its success benefits from big training data and super parallel computational power emerging in recent years, as well as advanced model design and training strategies. In this talk, I will try to introduce deep learning and explain the magic behind it with layman terms. Through concrete examples of computer vision applications, I will illustrate four key points about deep learning. (1) Different than traditional pattern recognition systems, which heavily rely on manually designed features, deep learning automatically learns hierarchical feature representations from data and disentangles hidden factors of input data through multi-level nonlinear mappings. (2) Different than existing pattern recognition systems which sequentially design or training their key components, deep learning is able to jointly optimize all the components and crate synergy through close interactions among them. (3) While most machine learning tools can be approximated with neural networks with shallow structures, for some tasks, the expressive power of deep models increases exponentially as their architectures go deep. (4) Benefitting the large learning capacity of deep models, we also recast some classical computer vision challenges as high-dimensional data transform problems and solve them from new perspectives. The introduced applications of deep learning in computer vision will focus on object detection, segmentation, and recognition. Some open questions related to deep learning will also be discussed in the end.

Tutorial outline
1. Introduction to deep learning
   a. Historical review of deep learning
   b. Introduction to classical deep models
   c. Why does deep learning work?
2. Deep learning for object recognition
   a. Deep learning for object recognition on ImageNet
   b. Deep learning for face recognition
3. Deep learning for object segmentation
   a. Fully convolutional neural network
   b. Highly efficient forward and backward propagation for pixelwise classification
4. Deep learning for object detection
   a. Jointly optimize the detection pipeline
   b. Multi-stage deep learning (cascaded detectors)
   c. Mixture components
   d. Integrate segmentation and detection to depress background clutters
   e. Contextual modeling
   f. Pre-training
   g. Model deformation of object parts, which are shared across classes
5. Open questions and future works

Speaker biographies

Wanli Ouyang received the PhD degree in the Department of Electronic Engineering, The Chinese University of Hong Kong, in which he is now a Research Assistant Professor. His research interests include image processing, deep learning, computer vision, and pattern recognition. He is a member of the IEEE. He served as the program chair in ACCV 2014 Workshop on deep learning on visual data. He is in the program committee for many prestigious conferences such as CVPR, ICCV, ECCV.

Xiaogang Wang received his Bachelor degree in Electrical Engineering and Information Science from the Special Class of Gifted Young at the University of Science and Technology of China in 2001, M. Phil. degree in Information Engineering from the Chinese University of Hong Kong in 2004, and PhD degree in Computer Science from Massachusetts Institute of Technology in 2009. He is an assistant professor in the Department of Electronic Engineering at the Chinese University of Hong Kong since August 2009. He received the Outstanding Young Researcher in Automatic Human Behaviour Analysis Award in 2011, Hong Kong RGC Early Career Award in 2012, and Young Researcher Award of the Chinese University of Hong Kong. He is the associate editor of the Image and Visual Computing Journal. He was the area chair of ICCV 2011, ECCV 2014 and ACCV 2014. His research interests include computer vision, deep learning, crowd video surveillance, object detection, and face recognition.


Title: Biomedical signal processing problems of human sleep monitoring
Tomasz M. Rutkowski
, University of Tsukuba, Japan

Sleep studies became recently a very hot topic in health and biomedical signal processing research domains. Sleep has a huge impact on human brain in relation to learning, memory formation and creativity. Our civilization is depraving people of sleep. There are also many medical conditions like sleep apnea of which proper diagnosis and therapy is a vital importance to many people. Sleep monitoring is a fascinating challenge in biomedical signal processing comprising brain (EEG) and body peripheral electrophysiological (EOG, EMG, EKG, etc.), acoustic (breath and snoring sounds), body movements and temperature, skin conductance, etc. The multimodality of the above signals recorded at different scales is the fascinating and challenging approach. The tutorial will present the state of the art approaches to multimodal signal processing, feature extraction and machine learning in application to sleep studies. Recently two KickStarter founded projects related to sleep monitoring resulted in around 400% (NeuroOn) and 1700% (Kokoon) of expected fundraising projections. The above numbers represent huge potential of the sleep monitoring related research and startup business projects.

The tutorial is addressed to biomedical and general signal processing audience, since sleep monitoring studies cover the wide range of biomedical signal analysis problems. The topics will cover introduction to sleep monitoring engineering problems, biomedical and multimedia signal processing and machine learning related topics not limited to:

  • brain electrophysiological (EEG) and optical (fNIRS) signals capture and processing for sleep stages classification;
  • peripheral electrophysiological signals (EOG, EMG, EKG, etc.) processing and related body movement noise filtering, classification and separation;
  • sound (breath and snoring sounds) and video (sleeping user monitoring) processing and classification;
  • remaining physiological signals (body temperature, skin conductance, etc.) will be also reviewed.

The above multimodal signal processing methods will cover the classical linear (adaptive linear filtering, PCA, etc.) and transform-based (FFT, wavelets, Hilbert, etc.), as well as the more recent data-driven methods covering univariate and multivariate EMD, synchrosqueezed wavelet transform, etc. Practical examples will be given to illustrate the presented multimodal biomedical signal processing methods with sleep studies used as case examples.

Speaker biography:

Tomasz M. RUTKOWSKI received his M.Sc. in Electronics and Ph.D. in Telecommunications and Acoustics from Wroclaw University of Technology, Poland, in 1994 and 2002, respectively. He received a postdoctoral training at the Multimedia Laboratory, Kyoto University, and in 2005-2010 he worked as a research scientist at RIKEN Brain Science Institute, Japan. Currently he serves as an assistant professor at the University of Tsukuba and as a visiting scientist at RIKEN Brain Science Institute. Professor Rutkowski's research interests include computational neuroscience, especially brain-computer interfacing technologies, computational modeling of brain processes, neurobiological signal and information processing, multimedia interfaces and interactive technology design. He received The Annual BCI Research Award 2014 for the best brain-computer interface project. He is a senior member of IEEE, a member of the Society for Neuroscience, and the Asia-Pacific Signal and Information Processing Association (APSIPA) where he serves as BioSiPS Technical Committee Chairman. He is a member of the Editorial Board of Frontiers in Fractal Physiology and serves as a reviewer for "IEEE TNNLS, IEEE TSMC - Part B, Cognitive Neurodynamics, and the Journal of Neural Engineering, PLOS One, etc.


Title: Depth-based Video Processing Techniques for 3D Contents Generation
Yo-Sung Ho
, Gwangju Institute of Science and Technology (GIST), Korea

Abstract: With the emerging market of 3D imaging products, 3D video has become an active area of research and development in recent years. 3D video is the key to provide more realistic and immersive perceptual experiences than the existing 2D counterpart. There are many applications of 3D video, such as 3D movie and 3DTV, which are considered the main drive of the next-generation technical revolution. Stereoscopic display is the current mainstream technology for 3DTV, while auto-stereoscopic multi-view display is a more promising solution that requires more research endeavors to resolve the associated technical difficulties.

In this tutorial lecture, we are going to cover the current state-of-the-art technologies for 3D contents generation. After defining the basic requirements for 3D realistic multimedia services, we will cover various multi-modal immersive media processing technologies. We also address the depth estimation problem for natural 3D scenes and discuss several challenging issues of 3D video processing, such as camera calibration, image rectification, illumination compensation and color correction. In addition, we are going to discuss the MPEG activities for 3D video coding, including depth map estimation, prediction structure for multi-view video coding, multi-view video-plus-depth coding, and intermediate view synthesis for multi-view video display applications.

Speaker biography:

Dr. Yo-Sung Ho received the B.S. and M.S. degrees in electronic engineering from Seoul National University, Seoul, Korea, in 1981 and 1983, respectively, and the Ph.D. degree in electrical and computer engineering from the University of California, Santa Barbara, in 1990. He joined ETRI (Electronics and Telecommunications Research Institute), Daejon, Korea, in 1983. From 1990 to 1993, he was with North America Philips Laboratories, Briarcliff Manor, New York, where he was involved in development of the Advanced Digital High-Definition Television (AD-HDTV) system. In 1993, he rejoined the technical staff of ETRI and was involved in development of the Korean DBS Digital Television and High-Definition Television systems. Since 1995, he has been with Gwangju Institute of Science and Technology (GIST), where he is currently Professor of Information and Communications Department. Since August 2003, he has been Director of Realistic Broadcasting Research Center at GIST in Korea. He has been serving as an Associate Editor of IEEE Transactions on Circuits and Systems Video Technology (T-CSVT). His research interests include Digital Image and Video Coding, Image Analysis and Image Restoration, Three-dimensional Image Modeling and Representation, Advanced Source Coding Techniques, Three-dimensional Television (3DTV) and Realistic Broadcasting Technologies.


Title: Graph Signal Processing for Image Compression & Restoration
Prof. Gene Cheung
, National Institute of Informatics, Tokyo, Japan
Prof. Xianming Liu, Department of Computer Science, Harbin Institute of Technology (HIT), Harbin, China

Abstract: Graph signal processing (GSP) is the study of discrete signals that live on structured data kernels described by graphs. By allowing a more flexible graphical description of the underlying data kernel, GSP can be viewed as a generalization of traditional signal processing techniques that target signals in regular kernels-e.g., an audio clip sampled periodically in time-while still providing a frequency domain interpretation of the observed signals. Though an image is a regularly sampled signal on a 2D grid, one can nonetheless consider an image patch as a graph-signal on a sparsely connected graph defined signal-dependently. Recent GSP works have shown that such approach can lead to a compact signal representation in the graph Fourier domain, resulting in noticeable gain in image compression and restoration. Specifically, in this tutorial we will overview recent advances in GSP as applied to image processing. We will first describe how a Graph Fourier Transform (GFT)-a generalization of known transforms like Discrete Fourier Transform (DFT) and Discrete Cosine Transform (DCT)-can be defined in a signal-dependent manner and leads to compression gain over traditional DCT for piecewise smooth images, outperforming H.264 intra by up to 6.8dB. We will then describe how suitable graph-signal smoothness priors can be constructed for a graph-based image denoising algorithm, outperforming state-of-the-art BM3D by up to 2dB for piecewise smooth images. Similar graph-signal smoothness priors can also be used for other image restoration problems, such as bit-depth enhancement of low-bit-depth images for HDR displays and de-quantization of compressed JPEG images. Finally, we will discuss how the graph Laplacian can be used as a contrast-enhancement booster for images captured in poorly lit environments that are also corrupted with noise.

Tutorial Outline:
1. Introduction
   - Graph Signal Processing (GSP) and applications of GSP
2. Fundamentals of GSP
   - Graph spectral theory
   - Graph transforms and wavelets
3. Image Coding using GSP
   - Review of image compression
   - Piecewise smooth image coding using GSP
   - Natural image coding using GSP
4. Image Restoration using GSP
   - Review of inverse imaging problems
   - Image denoising using graph-signal priors
   - Image bit-depth enhancement
   - Image contrast enhancement

Speaker Biography
Gene Cheung received the B.S. degree in electrical engineering from Cornell University in 1995, and the M.S. and Ph.D. degrees in electrical engineering and computer science from the University of California, Berkeley, in 1998 and 2000, respectively.

He was a senior researcher in Hewlett-Packard Laboratories Japan, Tokyo, from 2000 till 2009. He is now an associate professor in National Institute of Informatics in Tokyo, Japan. He is an adjunct associate professor in the Hong Kong University of Science & Technology (HKUST) (2015-present).

His research interests include image & video representation, immersive visual communication and graph signal processing. He has published over 150 international conference and journal publications. He has served as associate editor for IEEE Transactions on Multimedia (2007--2011) and DSP Applications Column in IEEE Signal Processing Magazine (2010--2014). He currently serves as associate editor for APSIPA Journal on Signal & Information Processing (2011--present) and SPIE Journal of Electronic Imaging (2014--present), and as area editor for EURASIP Signal Processing: Image Communication (2011--present). He served as the lead guest editor of the special issue on "Interactive Media Processing for Immersive Communication" in IEEE Journal on Special Topics on Signal Processing, published in March 2015. He served as a member of the Multimedia Signal Processing Technical Committee (MMSP-TC) in IEEE Signal Processing Society (2012-2014), and a member of the Image, Video, and Multidimensional Signal Processing Technical Committee (IVMSP-TC) (2015-2017). He has also served as technical program co-chair of International Packet Video Workshop (PV) 2010 and IEEE International Workshop on Multimedia Signal Processing (MMSP) 2015, area chair in IEEE International Conference on Image Processing (ICIP) 2010, 2012-2013, 2015, track co-chair for Multimedia Signal Processing track in IEEE International Conference on Multimedia and Expo (ICME) 2011, symposium co-chair for CSSMA Symposium in IEEE GLOBECOM 2012, and area chair for ICME 2013-2015. He was invited as plenary speaker for IEEE MMSP 2013 on the topic "3D visual communication: media representation, transport and rendering". He is a co-author of best student paper award in IEEE Workshop on Streaming and Media Communications 2011 (in conjunction with ICME 2011), best paper finalists in ICME 2011, ICIP 2011 and ICME 2015, best paper runner-up award in ICME 2012 and best student paper award in ICIP 2013.

Xianming Liu is an Associate Professor with the Department of Computer Science, Harbin Institute of Technology (HIT), Harbin, China. He also works as a project researcher at National Institute of Informatics (NII), Tokyo, Japan. He received the B.S., M.S., and Ph.D degrees in computer science from HIT, in 2006, 2008 and 2012, respectively. In 2007, he joined the Joint Research and Development Lab (JDL), Chinese Academy of Sciences, Beijing, as a research assistant. From 2009 to 2012, he was with National Engineering Lab for Video Technology, Peking University, Beijing, as a research assistant. In 2011, he spent half a year at the Department of Electrical and Computer Engineering, McMaster University, Canada, as a visiting student, where he then worked as a post-doctoral fellow from December 2012 to December 2013. He has published over 30 international conference and journal publications, including top IEEE journals, such as TIP, TCSVT, TIFS; and top conferences, such as CVPR, IJCAI and DCC.


Title: Spoofing and Anti-Spoofing: A Shared View of Speaker Verification, Speech Synthesis and Voice Conversion
Zhizheng Wu
, University of Edinburgh, UK
Tomi Kinnunen, University of Eastern Finland, Finland
Nicholas Evans, EURECOM, France
Junichi Yamagishi, University of Edinburgh, UK

Abstract: Automatic speaker verification (ASV) offers a low-cost and flexible biometric solution to person authentication. While the reliability of ASV systems is now considered sufficient to support mass-market adoption, there are concerns that the technology is vulnerable to spoofing, also referred to as presentation attacks. Spoofing refers to an attack whereby a fraudster attempts to manipulate a biometric system by masquerading as another, enrolled person. On the other hand, speaker adaptation in speech synthesis and voice conversion techniques attempt to mimic a target speaker's voice automatically, and hence present a genuine threat to ASV systems.

The research community has responded to speech synthesis and voice conversion spoofing attacks with dedicated countermeasures which aim to detect and deflect such attacks. Even if the literature shows that they can be effective, the problem is far from being solved; ASV systems remain vulnerable to spoofing, and a deeper understanding of speaker verification, speech synthesis and voice conversion will be fundamental to the pursuit of spoofing-robust speaker verification.

While the level of interest is growing, the level of effort to develop spoofing countermeasures for ASV is lagging behind that for other biometric modalities. What's more, the vulnerabilities of ASV to spoofing are now well acknowledged. A tutorial on spoofing and anti-spoofing from the combined perspective of speaker verification, speech synthesis and voice conversion is much needed. The tutorial will attract, not only members of the growing anti-spoofing research community, but also the broader community of general practitioners in speaker verification, speech synthesis and voice conversion.

The speakers have led the research community in anti-spoofing for ASV since 2013, have jointly authored a growing number of conference papers, book chapters and the latest survey paper published in Speech Communications in 2015. Between them they have organised two special sessions and one evaluation/challenge (http://www.spoofingchallenge.org/) on the same topic. The experience gained through these activities is be the foundation of this tutorial proposal for APSIPA ASC 2015.

Tutorial Outline:
PART1 (1h30):
1. Introduction [10 mins]
   a. Joint view of speaker verification, speaker adaptation and voice conversion (e.g. speaker identity)
2. State-of-the-art speaker verification techniques: [25 mins]
   a. standard approaches to ASV (GMM-UBM, JFA, iVector, PLDA);
   b. advanced modelling for text-independent and text-dependent ASV;
   c. evaluation metrics and protocols;
3. State-of-the-art speech synthesis techniques [25 mins]
   a. Statistical parametric speech synthesis (SPSS) (e.g., HMM, DNN based synthesis)
   b. Speaker adaptation for SPSS (e.g., adaptation in HMM and DNN framework)
   c. Unit selection speech synthesis
4. State-of-the-art voice conversion techniques [25 mins]
   a. Mapping-based voice conversion (e.g., JD-GMM, DNN)
   b. Frequency warping-based voice conversion (e.g., WFW, DWF, amplitude scaling)
   c. Unit selection and exemplar-based voice conversion (e.g., frame selection, NMF)
5. Q&A [5 mins]
PART1 (1h30):
6. Vulnerability of speaker verification to speech synthesis and voice conversion [25 mins]
   a. Why speaker verification is vulnerable to speech synthesis and voice conversion;
   b. specific spoofing attack studies including speech synthesis and voice conversion;
   c. evaluation databases, protocols, and metrics for vulnerability assessment;
7. Countermeasures to speech synthesis and voice conversion spoofing [25 mins]
   a. specific countermeasures, e.g. spectral, phase and mixed approaches;
   b. countermeasure integration;
   c. standard and non-standard databases;
   d. methodology weaknesses, and the need for generalised countermeasures.
8. ASVspoof 2015 [25 mins]
   a. a detailed presentation of the strategy, database, protocols and metrics;
   b. results and lessons learned;
9. Directions for future challenges and research [5 mins]
10. Q&A [10 mins]

Speaker Biographies:
Zhizheng Wu (University of Edinburgh, UK, zhizheng.wu@ed.ac.uk) is a research fellow in the Centre for Speech Technology Research (CSTR) at the University of Edinburgh since 2014, and he received the Ph.D. degree from Nanyang Technological University (NTU), Singapore. From 2007 to 2009, he was with Microsoft Research Asia as an intern researcher. He received the best paper award in APSIPA ASC 2012. His research interests includes speech synthesis, voice conversion, spoofing and anti-spoofing, and speaker verification.

Tomi Kinnunen (University of Eastern Finland, Finland, tomi.kinnunen@uef.fi) received the Ph.D. degree in computer science from the University of Eastern Finland (UEF, formerly Univ. of Joensuu) in 2005. From 2005 to 2007, he was an associate scientist at the Institute for Infocomm Research (I2R) in Singapore. Since 2007, he has been with UEF. In 2010-2012, his research was funded by the Academy of Finland in a post-doctoral project focusing on speaker recognition. He is the principal investigator of a 4-year Academy of Finland project focusing on speaker recognition, voice conversion and anti-spoofing techniques. Dr. Kinnunen's team is a regular participant to the NIST speaker recognition evaluations. He was the chair of Odyssey 2014: The Speaker and Language Recognition workshop. He is a partner in a recently kicked-off, large Horizon 2020 funded "OCTAVE" project (octave-project.eu) that trials technology transfer of speaker verification technology to both logical and physical access control scenarios, involving integration of practical spoofing countermeasures.

Nicholas Evans (EURECOM, France, evans@eurecom.fr) is an Assistant Professor at EURECOM where he heads research in Speech and Audio Processing. In addition to other interests in speaker diarization, speech signal processing and multimodal biometrics, and in the scope of the EU FP7 ICT TABULA RASA project, he has studied the threat of spoofing to automatic speaker verification systems and developed new spoofing countermeasures. He serves as Lead Guest Editor for the IEEE T-IFS special issue on Biometric Spoofing and Countermeasures and the IEEE SPM special issue on Biometric Security and Privacy and is an Associate Editor of the EURASIP Journal on Audio, Speech, and Music Processing. He was general co-chair for IWAENC 2014 and will be technical co-chair for EUSIPCO 2015. He also contributed to the organisation of the TABULA RASA Spoofing Challenge held in conjunction with ICB 2013.

Junichi Yamagishi (University of Edinburgh, UK, jyamagis@inf.ed.ac.uk) is a senior research follow and holds an EPSRC Career Acceleration Fellowship in the Centre for Speech Technology Research (CSTR) at the University of Edinburgh. He was awarded a Ph.D. by Tokyo Institute of Technology in 2006 for a thesis that pioneered speaker-adaptive speech synthesis and was awarded the Tejima Prize as the best Ph.D. thesis of Tokyo Institute of Technology in 2007. Since 2006, he has been in CSTR and has authored and co-authored about 100 refereed papers in international journals and conferences. His recent important work includes spoofing against speaker-verification systems using the adaptive speech synthesis and the development of its countermeasures. He is a scientific committee and area coordinator for Interspeech 2012.


Title: Wireless Human Monitoring
Tomoaki Ohtsuki
and Jihoon Hong, Department of Information and Computer Science, Keio University

Abstract: There has been a growing interest in human monitoring using wireless technologies for elderly care, fitness tracker, target localization, and so on. Recent trends in human monitoring among academic and industrial researchers are mostly focusing on developing wearable sensors, which the user need to always carry on. Beyond the wearable sensors, non-intrusive and non-contact wireless monitoring has become an alternative solution because it is more user-friendly and it enables long-term monitoring. In this tutorial we offer a comprehensive review of the state-of-the-art researches on wireless human monitoring technologies enabling the monitoring of human activities and locations without wearable sensors. We first give an overview of activity recognition using wireless techniques, such as, activity recognition based on features obtained with radio waves. We then introduce other activity recognition technique based not on radio waves but on other features, such as, temperature distribution. We also introduce various localization techniques, particularly device-free localization techniques.

Tutorial Outline:
Part 1: Activity recognition
1 Introduction to activity recognition
   1.1 Motivation and application
   1.2 Application requirements
   1.3 Existing sensor technologies and their problems
   1.4 Active sensors (e.g., wearable sensors) vs. passive sensors
2 Related work on passive sensors
3 Low-resolution infrared array-based activity recognition
4 Doppler radar-based vital sign monitoring
5 Conclusions and open research issues
Part 2: Indoor localization
1 Introduction to indoor localization
   1.1 Motivation and Application
   1.2 Application requirements
   1.3 Existing sensor technologies and their problems
   1.4 Device-based active (DBA) localization: GPS, RFID, wireless sensor networks
   1.5 Device-free passive (DFP) localization: Received signal strength (RSS), channel state information (CSI), MIMO
2 Device-free passive localization using antenna array
   2.1 Background of antenna array
   2.2 Radio wave features from antenna array
   2.3 SVM-based fingerprinting localization
   2.4 Results and discussion
3 Conclusions and open research issues

Speaker Biographies:
Tomoaki Ohtsuki received the B.E., M.E., and Ph.D. degrees from Keio University, Yokohama, Japan, in 1990, 1992, and 1994, respectively, all in electrical engineering. From 1998 to 1999, he was with the Department of Electrical Engineering and Computer Sciences, College of Engineering, University of California, Berkeley, CA, USA. In 2005, he joined Keio University, where he is currently a Professor with the Department of Information and Computer Science. He has over 130 journal papers and 310 international conference papers, and has received 14 awards. He is engaged in research on wireless communications, signal processing, and information theory. He served as a Technical Editor of the IEEE Wireless Communications Magazine, and a Symposium Co-Chair of many conferences, including the IEEE Globecom and the IEEE ICC. He is currently serving as an Editor of the IEEE Communications Surveys & Tutorials, IEICE Communications Express, and Elsevier's Physical Communication.

Jihoon Hong received the B.Sc. degree in computer software from Sangmyung University, Seoul, Korea, in 2006 and the M.Sc. and Ph.D. degrees in engineering from Keio University, Yokohama, Japan, in 2011 and 2014, respectively. In 2014, he was a Special Researcher of Fellowships with the Japan Society for the Promotion of Science (JSPS) for Japanese Junior Scientists. Since 2014, he has been a Postdoctoral Fellow with JSPS and a Visiting Researcher in science and technology with Keio University. His research interests include array signal processing, machine learning, wireless networking, and device-free sensing technologies.