James Ting-Ho Lo
Professor
Department of Mathematics and Statistics University of Maryland Baltimore County 1000 Hilltop Circle Baltimore, MD 21250-0001 |

My research interests are concentrated in two clusters:

** Recurrent deep learning machines and the synthetic approach to designing dynamical systems: **

The conventional analytic approach suffers from the following limitations:

- A physics-based model is required.
- Physics-based models frequently contain prejudices or restrictive assumptions.
- Nonlinearity often means intractability.
- Adaptiveness or robustness requirement often causes intractability or more intractability.

Multilayer perceptrons (MLPs) and their variants (for a good reason) are perhaps the best known artificial neural networks. MLPs with feedback connections are called recurrent multilayer perceptrons (RMLPs). MLPs and RMLPs are efficient or parsimonious universal approximators of functions and dynamical systems, respectively. MLPs and RMLPs with a deep architecture are called deep learning machines and recurrent deep learning machines respectively. They are well suited for designing signal predictors/estimators/filters, system identifiers/controllers, time series predictors/identifiers, and fault detecters/identifiers.

** Computational models of the brain and brain-like learning machines: **

There have been a large number of findings in neuroscience. Integrating these findings into a computational model of the brain is necessary for understanding the brain and developing brain-like learning machines. As elementary computation of the brain is performed by biological neural networks (BNNs), a first step in constructing a computational model of the brain is to construct that of BNNs. Based on a computational model of BNNs, computational models of visual, auditory, somatosensory, and somatomotor systems are to be developed. Ultimately, these systems together with models of other parts in the brain will be connected so as to deliver high-level cognitive functions such as decision making, prediction, creation and other human behavior.

**Deconvexification for data fitting**
The risk-averting error criterion used in the convexification method for avoiding nonglobal local minima in training neural networks and estimating nonlinear regression models causes computer
register
overflow when its risk sensitivity index is large. To eliminate this difficulty, a normalized risk-averting error criterion is used instead. It is proven that the number of nonglobal local minuma decreases to zero as the risk sensitivity index goes to infinity. Starting with a very large risk sensitivity index and gradually decreasing it to zero was always effective in finding a global minimum in all the numerical examples worked out so far.

**Recurrent deep learning machines**
A research thrust in recent years is the development of deep learning machines. Advantages of learning machines with a deep architecture over those with a shallow architecture were analyzed by Y. Bengio and Y. LeCun. However, feedback structures are conspicuously missing in the existing deep learning machines. Since neural networks with a feedback structure, called recurrent neural networks, are necessary for universally approximating dynamical systems, and provide better detection/recognition performance even for static patterns, developing recurrent deep learning machines is an important step beyond the current research thrust. Because of the additional training difficulty caused by feedback structures, the newly developed deconvexification method is expected to be of a great value in the development of recurrent deep learning machines.

**A cortex-like learning machine**
Starting with a functional model and a low-order model of biological neural networks, a cortex-like learning machine, called clustering interpreting probabilistic associative memory (CIPAM), has been derived. CIPAM the following advantages:

- No handcrafted label of the training data is needed for training the clusterer. Training the SPUs in the interpreter requires handcrafted labels. However, only one label is required for a cluster obtained by the clusterer, making the requirement much easier to fulfill than otherwise.
- The clusterer in CIPAM is a clusterer for spatial or temporal pattern or causes, but it does not involve selecting a fixed number of prototypes, cycling through the training data, using prototypes as cluster labels, or minimizing a non-convex criterion.
- Both the unsupervised and supervised training mechanisms are of the Hebbian type, involving no differentiation, backpropagation, optimization, iteration, or cycling through the data. They learn virtually with "photographic memories", and are suited for online adaptive learning. Large numbers of large temporal and spatial data such as photographs, radiographs, videos, speech/language, text/knowledge, etc. are learned easily (although not computationally cheap).
- The "decision boundaries" are not determined by exemplary patterns from each and every pattern and "confuser" class, but by those from pattern classes of interest to the user. In many applications such as target and face recognition, there are a great many pattern and "confuser" classes and usually no or not enough exemplary patterns for some "confuser classes".
- Only a small number of algorithmic steps are needed for retrieval of labels. Detection and recognition of multiple/hierarchical temporal/spatial causes are easily performed. Massive parallelization at the bit level by VLSI implementation is suitable.
- Probability and membership functions of labels are generated and can easily be obtained from the interpreter SPUs.
- CIPAM generalizes not by only a single holistic similarity criterion for the entire input exogenous feature vector, which noise; erasure; distortion and occlusion can easily defeat, but by a large number of similarity criteria for feature subvectors input to a large number of UPUs (unsupervised processing units) in different layers. These criteria contribute individually and collectively to generalization for single and multiple causes. Example 1: smiling; putting on a hat; growing or shaving beard; or wearing a wig can upset a single similarity criterion used for recognizing a face in a mug-shot photograph. However, a face can be recognized by each of a large number of feature subvectors of the face. If one of them is recognized to belong to a certain face, the face is recognized. Example 2: a typical kitchen contains a refrigerator, a counter top, sinks, faucets, stoves, fruit and vegetable on a table, etc. The kitchen is still a kitchen if a couple of items, say the stoves and the table with fruit and vegetable, are removed.
- Masking matrices in a PU eliminate effects of corrupted, distorted and occluded components of the feature subvector input to the PU, and thereby enable maximal generalization capability of the PU, and in turn that of CIPAM.
- CIPAM is a neural network, but is no more a blackbox with "fully connected" layers much criticized by opponents of such neural networks as multilayer perceptrons (MLPs) and recurrent MLPs. In a PU of CIPAM, synaptic weights are covariances between an orthogonal expansion and label of its input feature subvector. Each PU has a receptive region in the exogenous feature vector input to CIPAM and recognizes the cause(s) appearing within the receptive region. Such properties can be used to help select the architecture (i.e., layers, PUs, connections, feedback structures, etc.) of CIPAM for the application.
- CIPAM may have some capability of recognizing rotated, translated and scaled patterns. Moreover, easy learning and retrieving by a CIPAM allow it to learn translated, rotated and scaled versions of an input image with ease.
- The hierarchical architecture of the clusterer stores models of the hierarchical temporal and spatial worlds (e.g., letters, words and sentences).
- The synaptic weight matrices in different PUs can be added up to combine learned knowledge at virtually no additional cost. This property can be exploited to greatly increase CIPAM's capability of recognizing rotated, translated and scaled patterns.
- Ambiguity and uncertainty are represented and resolved with conditional probabilities and truth values in the sense of fuzzy logic.
- Noises and interferences in inputs self-destruct like random walks with residues eliminated gradually by forgetting factors in the synapses, leaving essential informations that have been learned by repetitions and emphases.
- The architecture of a CIPAM can be adjusted without discarding learned knowledge in the CIPAM. This allows enlargement of the feature subvectors, increase of the number of layers, and even increase of feedback connections.

**A low-order model of biological neural networks**
Motivated by a functional model of biological neural networks (BNNs), a low-order model (LOM) of the same has been obtained. It is a recurrent hierarchical network of processing units (PUs), each comprising models of axonal/dendritic trees (for encoding inputs to the processing unit); synapses (for storing covariances between axonal/dendritic codes and labels of said inputs); spiking and nonspiking somas (for retrieving/generating labels); unsupervised/supervised learning mechanisms; and a maximal generalization scheme, and feedback nerves with different lengths among PUs.

To the best of my knowledge, LOM is the only biologically plausible model of BNNs that provides logically coherent answers to the following questions:

- What is the information carried by the spike rate in a spike train?
- How is knowledge encoded?
- In what form is the encoded knowledge stored in the synapses?
- What does the dendritic node do? How does the dendritic node do it?
- How are dendritic nodes organized into dendritic trees? Why is there compartmentalization in dendritic trees?
- How do the dendritic trees contribute to the neural computation?
- How is unsupervised learning performed by a multilayer neural network?
- How is a piece of knowledge stored in the synapses retrieved and converted into spike trains?
- How do neural networks generalize on corrupted, distorted or occluded patterns?

**A functional model of biological neural networks**
Derivation of a functional model of biological neural networks, called temporal hierarchical probabilistic associative memory (THPAM), is guided by the following four neurobiological postulates:

- The biological neural networks are recurrent multilayer networks of neurons.
- Most neurons output a spike train.
- Knowledge is stored in the synapses between neurons.
- Synaptic strengths are adjusted by a version of the Hebb rule of learning.

The construction of a functional model of biological neural networks based on all the four postulates has broken the barriers confining the multilayer perceptrons and the associative memories. A first contribution of this paper lies in each of the following features of that such existing models as the recurrent multilayer perceptron and associative memories do not have: 1. a recurrent multilayer network learning by a Hebbian-type rule; 2. fully automated unsupervised and supervised Hebbian learning mechanisms (involving no differentiation, error backpropagation, optimization, iteration, cycling repeatedly through all learning data, or waiting for asymptotic behavior to emerge); 3. dendritic trees encoding inputs to neurons; 4. neurons communicating with spike trains carrying subjective probability distributions or membership functions, 5. masking matrices facilitating recognition of corrupted, distorted, and occluded patterns; and 6. feedbacks with different delay durations for fully utilizing temporally and spatially associated information.

**Convexification for data fitting, and robust processing**
The neural network approach has been plagued by the local minima problem in training. To solve this problem, a new type of risk-averting error criterion was discovered. On one hand, the convexity region of the criterion expands as its risk-sensitivity index increases. On the other hand, when its sensitivity index goes to zero, the criterion converges to the standard mean squared error. These properties suggest a convexification and a deconvexification phase in minimizing the standard mean squared error for avoiding poor local minima. This two-phase method can be used together with any local or "global" optimization technique and is expected to effectively solve the local minima problem in not only training neural networks but also estimating nonlinear regression models and other kinds of nonlinear data fitting.

As its sensitivity index goes to infinity, the risk-averting error criterion approaches a minimax criterion. This suggests that the risk-averting error criterion provides a continuous spectrum of robustness. Depending on the application, the degree of robustness can be selected by setting an appropriate value of the sensitivity index.

**Adaptive system identification**
Two new paradigms of adaptive processing: (1) adaptive feedforward and recurrent neural networks with long- and short-term memories and (2) accommodative neural networks (i.e., adaptive recurrent neural networks with fixed weights) have been developed for adaptive system identification. The former adjust only their linear weights online, and the latter do not even need online adjustment for adaptation. These represent perhaps the only two effective systematic general approaches to adaptive processing for system identification.

**Overcoming the compactness limitation of neural networks**
A neural filter is obtained by fitting a recurrent neural network (RNN) such as a recurrent multilayer perceptron to signal and measurement data. If the ranges of the signal and measurement expand over time, such as in financial time series prediction, satellite orbit determination, aircraft/ship navigation, and target tracking, or are large relative to the filtering resolution or accuracy required, then the size of the RNN and the training data set must be large. The larger the RNN and the training data set is, the more difficult it is to train the RNN on the training data set. Furthermore, the time periods, over which the training data is collected, by computer simulation or actual experiment, are necessarily of finite length. If the measurement and signal processes grow beyond these time periods, the neural network trained on the training data usually diverges.
To eliminate this difficulty, called the compactness limitation, we propose the use of dynamical range transformers. There are two types of dynamical range transformer, called dynamical range reducers or extenders, which are preprocessors and postprocessors respectively of an RNN. A dynamical range reducer transforms dynamically a component of an exogenous input process (e.g. a measurement process) and sends the resulting process to the input terminals of an RNN so as to reduce the valid input range or approximation capability required of the RNN. On the other hand, a dynamical range extender transforms dynamically the output of an output node of an RNN so as to reduce the valid output range or approximation capability required of the RNN. The purpose of both the dynamical range reducer and extender is to ease the RNN size and training data requirements and thereby lessen the training difficulty. The fundamental range transforming requirement in using a dynamical range transformer (i.e. dynamical range reducer or extender) is the existence of a recursive filter comprising an RNN and the dynamical range transformer that approximates the optimal filter to any accuracy, provided that the RNN with the selected architecture is sufficiently large. A recursive filter comprising an RNN and dynamical range transformers is called a recursive neural filter.

**Synthetic approach to optimal filtering**
The long-standing notorious problem of nonlinear filtering (e.g., prediction, estimation, smoothing) was solved in its most general setting in 1992 by a synthetic (or neural network) approach. R. E.
Kalman said in his 1998 email to me: "I read your patents and paper. I am absolutely amazed." The synthetic approach has the following advantages:
1. No such assumption as the Markov property, linearity of signal or measurement process, Gaussian distribution, or additive measurement noise is necessary.
2. It applies, even if a mathematical model of the signal and measurement processes is not available.
3. The resultant neural filter has the minimum error variance for the given structure.
4. The neural filter converges to the minimum-variance filter as the number of hidden neurons increases.
5. Much like the Kalman filter, the neural filter requires no Monte Carlo simulation online and is well suited for real-time processing.

1. **Developing recurrent deep learning machines.**

Deep learning machines (DLMs) are known to have more efficient architectures, less training data required, and more effective representations of functions or classifiers than do shallow learning machines. Effective deep architectures such as in the convolutional nets and training methods such as the intriguing greedy layer-wise training strategy for Bolzmann machines have been developed in the recent research thrust on DLMs.

However, feedback structures are conspicuously missing in the research thrust. Feedbacks to a computing node bring current or past information contained in neighboring or larger receptive fields of other computing nodes to said computing node for forming better local representations or features. Such information is required in processing dynamical data (i.e., sequential data with recursive dynamics) and can enhance processing accuracy and generalization for both sequential data (with or without recursive dynamics) and static data.

Deep learning machines with a feedback structure are called recurrent deep learning machines (RDLMs). Extending DLMs such as convolutional nets and Bolzmann machine to RDLMs such as recurrent convolutional nets and recurrent Bolzmann machines is an important and challenging research underway. those with feedbacks Due to a large number of nonglobal local minima on the training error landscape, training DLMs and especially RDLMs, is difficult. The deconvexification method, which overcomes the well-known local-minimum problem, is expected to play an important role in the development of RDLMs.

2. **Developing a systematic general approach to designing adaptive or robust dynamical systems for system control and filtering in uncertain or dynamically changing environments.**

Because of their practical importance, robustness and adaptiveness are two fundamental issues extensively studied in control and filtering for more than 30 years. Two simple effective systematic general approaches to adaptive processing have been developed for system identification. One approach employs a UAODS with long- and short-term memories, the former being determined in the lab before deployment and the latter adjusted on-line by one of the fast adaptive linear filter algorithms (e.g., LMS and RLS) developed over more than 40 years. The other uses a UAODS (with fixed weights) without online weight adjustment, which is called accommodative processor.

For robust processing, we use the novel normalized risk-averting criterion (NRAE), which emphasizes greater errors in an exponential manner and thereby induces robust performance. As opposed to the H-infinity or minmax criterion, which is often too pessimistic, NRAE allows us to select any desired degree of robustness by setting an appropriate risk-sensitivity index.

3. **Developing faithful computational models of the brain and brain-like learning machines.**

A low-order model (LOM) of biological neural networks was recently reported. LOM is a network of biologically plausible models of dendritic/axonal nodes and trees, spiking/nonspiking somas, unsupervised/supervised covariance/accumulation learning mechanisms, feedback connections with various time delays, and a scheme for maximal generalization. These component models were motivated and necessitated by making LOM learn and retrieve easily; and cluster, detect and recognize multiple/hierarchical corrupted, distorted and occluded temporal and spatial patterns.

Current and future work includes developing mechanisms for motion detection and attention selection for LOM, extending LOM to low-order models of visual; auditory; somatosensory; somatomotor; and other systems, and integrating all these models into a computational model of the brain.

Besides, LOM is ready for application to detection, clustering and recognition of such spatial patterns as handwriting, faces, targets, explosives, weapons (in baggage and containers) and such spatial patterns as speech, text, video, financial data.