Among the topics of machine learning, I have a huge interest in neural networks due to their ability to learn good feature representations through deep learning. However, I believe that there are some very unexplored areas that are of major importance to the future of machine learning.
First, we have the fact that the order of samples used for training influences the resulting generalization, which is good when considering curriculum learning, but is an issue for online learning, where more data is made available constantly. Second, there are other fields of machine intelligence that developed the concept of sparse distributed representations, which current developments in unsupervised learning have shown that deep neural networks may be able to learn, allowing data-format agnostic algorithms to be used. Finally, modelling temporal or sequential data sets have been studied, but their relation with the previous two topics remain vastly unexplored, which limits their applications.
Hence, my current research on this field tries to answer three questions:
- How to make the best use of the available data during learning and how to robustly integrate new obtained data?
- How to transform data into a high-level, generic and robust format that allows data-format agnostic models to be used to perform the desired inference on any data type?
- How to perform deep learning on data streams?
Curriculum and online learning
Research has shown that the order in which examples are provided to learning models can affect their behavior, either for better or worse, restricting the weight changes for future examples. Within a data set, a common solution is to apply momentum to the traditional stochastic gradient learning, so that the noise between minibatches is smoothed.
However, the problem doesn’t occur only because of stochastic gradient descent algorithms. For instance, if the first examples learned contain less noise or are more similar to the general data, which makes them better representatives for the whole data set, then learning can be improved. On the other hand, if the neural network is trained first on a smaller data set, then training on new data can be harder and provide worse results than training with all data at the same time.
This kind of learning is concerned with finding good samples to provide the learning algorithm first, so that future performance can be improved. While it has been shown that providing less noisy data or more frequent ones first improves learning, the selection was hand-made and it isn’t clear how one would do such choice in a big data set.
For the frequency criteria in text documents, it’s straightforward to define the samples with the most common features (just count the number of words in each n-gram), but how can we do this in an image data set? For the noise level criteria, how can one define what is noise or signal in a sample before the learning occurs?
I think that clustering algorithms with domain-specific similarity or distance measures can be helpful in this setting. The samples that should be given to the learning algorithm first are the ones that are the most similar to the rest of the data set while also being different from each other. I believe that this can boost the performance both in the beginning of the training, since less samples need to be evaluated while fine-tuning isn’t being performed yet, and at the end, with better placement of the weights during the initial training, as shown with hand-designed curriculum learning.
As I said before, we know that the order in which the data is provided is important for the quality of generalization. I conjecture that the same mechanism that allows curriculum learning to provide better results also impairs the learning on increasing data sets: there is a premature resource allocation that, once the neural network’s weights are defined, they don’t change a lot. Actually, this observation has been made previously in the literature, but I think this is the main link between curriculum and online learning.
I’d argue that we need neural network algorithms to be able to work in an online fashion, both because we have a huge stream of data flowing in continuously and because training deep neural networks is considerably expensive. To my knowledge, most current deep learning techniques aren’t able to deal with incremental data sets, requiring full retraining of the model if a significant amount of new data is available (it is possible to incrementally train the models, but the weights might be in places that reduce the generalization capacity of the whole network when the new data is considered).
An exception to this rule might be generative stochastic networks (GSN), which are able to learn all layers of representation at the same time. How well it adapts to increasing data sets is a current research topic of mine, but I think they might provide much better results than layer-wise training. The problem with layer-wise training is that it may not be a good idea to just retrain each layer with the initial conditions provided by the current weights, since higher layers could have very different inputs from all the new training below them. With all layers learning at the same time, they might learn representations for new data simultaneously, although I’m not sure they are able to provide similar performance to training in the full data set.
Sparse distributed representation (SDR)
Sparse distributed representation is a data format that is robust to noise and provides better representation of the data. In my opinion, this representation can be viewed from two different sides.
The first one is comes from Numenta as a way of representing information in their hierarchical temporal memory (HTM), which is a model of the neocortex. In this setting, SDRs are used to map data from and to the HTM, allowing the learning and prediction engine to interact with the real world through encoders and decoders, which currently are hand-designed, that map some specific kind of data to/from SDRs.
The second one comes from research on unsupervised deep learning, where it has been shown that higher level of representations are able to unfold the data distribution manifold, making the components more independent, and provide better representation. If the highest level is a binary layer, such as a restricted Boltzmann machine (RBM), then the highest layer can be viewed as a binary representation of the provided data. Moreover, if we force sparsity through regularization, it can be viewed as a SDR.
Therefore, I believe that deep neural networks with unsupervised learning can be used to learn the encoder and decoder for a given kind of data, allowing the inference to be performed in a higher-level algorithm, which can be a data-format agnostic learning engine such as the HTM.
I believe that the topic of temporal or sequence learning is very important for advancing machine intelligence. Techniques such as RBMs with 3 communicating layers (2 data sequence and 1 hidden) and long-short term memories (LSTM) have been used extensively to model sequence of events on a data set.
These data-format agnostic frameworks are ideal for building inference machines, where the data is converted to SDRs through learned encoders, which allows good high-level representation. While these could be merged together in current data sets to deal with problems such as video, they could also take advantage of deep online learning for continuous learning from the data stream directly.