"Bayesian statistics and massive data streams"

Applied Mathematics Session at Japanese-American Frontiers Symposium, 2008


  • Chair, Dr. Jake M. Hofman,
    Yahoo! Research, Human Social Dynamics Group,
    Interests: Bayesian statistics, modularity in biological networks, data-driven modeling, complex systems, network theory/analysis, machine learning, statistical inference.
  • Japanese Speaker, Dr. Yoshiyuki Kabashima,
    Professor, Department of Computational Intelligence and Systems Science, Tokyo Institute of Technology.
    Interests: Statistical physics and information sciences, error-correcting codes, cryptography, CDMA multi-user detection, data compression, learning in neural networks, spin glasses.
    Prof. Kabashima's talk: JAFoS_kaba.pdf
  • U.S. Speaker, Dr. Mark H. Hansen,
    Associate Professor of Statistics, UCLA.
    Interests: Statistical methods for embedded sensing, Participatory urban sensing, Text mining and information retrieval, Data streams.

  • Organizers:

  • Dr. Katsuhiro Nishinari, Associate Professor, Department of Aeronautics and Astronautics, University of Tokyo.
    Interests: Cellular Automaton (Traffic and Granular flows, Ants, pedestrian, molecular motors), Soliton theory and its applications, Networked Systems.
  • Dr. Raissa D'Souza, Associate Professor, UC Davis, Mech and Aero Eng Dept.
    Interests: Phase transitions, self-organization, structure and function of networks, physics of computation, complex systems.

  • Session abstract:

    In the modern-day world we are continually exposed to massive quantities of streaming data -- from the Internet, to cell phones, to real-time stock quotes. Moreover, as scientists, we are now in an era where truly massive quantities of data are collected on-the-fly. Astrophysics and particle physics experiments routinely collect terrabytes of data each day. Massive databases from gene-sequencing chips are accumulating and driving rapid advances in bioinformatics. Stability of data rich computer and Internet applications, on which our modern-day society is based, involves analyzing massive log files and databases. And, in the past year alone we have seen explosive growth in the amount of information available on the World-Wide-Web due to user generated content, much of which is scientific lectures and datasets. But is there any way to organize and make sense of all this data? In addition, what new techniques exist for presenting these massive data streams?

    The Bayesian approach to data analysis offers a new paradigm. Rather than traditional approaches of hypothesizing a model, then seeing how it fits the data, the Bayesian approach begins from the raw data and enables the best description to emerge. This is slowly revolutionizing how we deal with data. For instance, the best language translation software is no longer based on insights from human experts, analyzing syntax and vocabulary, but is based on statistical string-matching of text -- scanning in tomes, such as ``War and Peace" in multiple languages and using statistics of word co-occurrence for translation.

    In addition to data analysis, what techniques for presenting data can we develop that transcend the traditional visual approach of charts and graphs? Can we sonify data and hear certain patterns in a data set that are difficult to recognize with traditional visualization tools? And, given the massive, real-time quantities of data to process, can the analyzes be presented in multiple modes simultaneously, for instance augmenting visual channels with aural ones.

    Dealing with massive data streams is such a modern pressing problem, not only for scientists, but for individuals, that their sonification can capture our zeitgiest -- giving us a visceral response to data -- intertwining science and art. A project for visualizing and sonifying Internet data streams, by US speaker Mark Hansen (a statistician who studies signal processing and information theory), was transformed into a highly acclaimed art installation recently exhibited at museums across the US, including the Whitney Museum of American Art, in New York City.

    This topic is extremely suitable for JFoS. Analyzing massive data sets is a frontier area, becoming of forefront importance across all fields of science. In addition, Japan and US are the world leaders in this area, with most of the world's experts in machine learning residing in these two countries. This session will highlight a frontier of science that unifies our two countries.