Statistics is the cornerstone of most of my research, where probabilistic models act both as solutions themselves, such as in causal data modelling, and tools for understanding and solving problems in other areas, such as loss function definitions.
In this area, my main interests involve:
- Generative models, which provides basis to my neural networks research and helps understanding how to define more informative cost functions and how to create temporal and causal networks;
- Nonparametric models, which provides high-quality function and distribution approximations;
- Online models and how to learn with incremental data sets.
Generative models are statistical models that describe how the data could have been created. Although machine learning researchers sometimes don’t think about this, many clustering and regression models can actually be analyzed with a statistical point of view of generative models, which allows higher freedom.
For instance, the squared loss is frequently used, but its statistical view of likelihood from a normal distribution allows one to view the inherent limitations of this model. This loss function states two things, one for the learning model and one for the user: the learning model is told that the user doesn’t care if the predicted value was higher or lower than the actual observed value if the distance is the same, and the user is told that the probability of positive or negative errors is the same. The first one becomes false if the user places more importance to negative than positive predictions for example, while the second one becomes false if the model, due to its limited capacity, has predictions biased to one side (negative or positive) and cannot provide this feedback for the user. To solve this problem, I’ve introduced a method to create asymmetric distributions, which are able to provide both sides of communication between the model and the user without much overhead.
Other types of generative models that I’m interested, besides neural networks, are temporal and causal models. In particular, how to infer relationships between different streams of data, with focus on modelling what could have caused the state transition in a hidden Markov model, and how to build temporal models based on static ones.
Nonparametric models are statistical models that provide a sweet spot for learning: they don’t overtrain, because they are Bayesian in their predictions, and they also don’t undertrain, as their infinite number of parameters allows the models to fit any amount of data. However, their main limitation is the considerably higher computational cost.
The two nonparametric models I currently work with are Gaussian processes, which are used for function approximation, and Dirichlet processes, which are used for distribution approximation. These two models are used in the other areas to provide high-quality estimators and on their own to improve generative models.
I also try to take inspiration from them on how to flexibly adjust the number of parameters used based on the data provided, which I believe could provide benefits to neural networks. Since convergence to poor local minima may be an issue in this field, specially when considering deep architectures, slowly adding more capacity to the network may reduce the number of poor minima during training and may provide better results.
Many statistical models are able to handle new data, either by retraining the whole model or by performing some online learning directly. Although I’m very interested in the online learning mechanism and how this can be approximated in other models, such that they might be able to handle high-throughput data streams, the problem of increasing data sets is also interesting.
Consider, for instance, the problem of approximating the parameter estimation of a model using variational inference. It is well known that the variational distributions used influence the resulting estimation’s quality and the estimated parameters represent a local maximum.
This raises a few questions with data sets that are increasing with new data collected from a stream. How does the data set affects this convergence rate? Is it better to train using what part of the data set is available first and then increment the data set with new incoming data or just retrain the whole model from scratch?
To me, these questions don’t apply only to variational inference, but also simple maximum likelihood estimation. Moreover, they may be related to the issues of curriculum and online learning in neural networks, which are also concerned with the problem of using part of the data set for training and neural networks can be viewed as statistical models. However, since neural networks may be more complex (and more accurate) than some statistical models for the same problems, maybe analyzing this problem on them may provide some insights for the neural networks field.