List of papers

UNSUPERVISED LEARNING / MANIFOLD



@article{gongD-zhao-medioni-12icml-multiple-manifold-structure-learning,
 title={Robust Multiple Manifolds Structure Learning},
  author={Gong, D. and Zhao, X. and Medioni, G.},
  journal={ICML-12},
  year={2012}
  annote = {

Combines local manifold construction and merges the manifolds obtained based
on a new curvature-level similarity measure.  A terrific idea. 

??project?  is code available? 

Abstract: We present a robust multiple manifold structure learning (RMMSL)
scheme to robustly estimate data structures under the multiple low intrinsic
dimensional manifolds assumption. In the local learning stage, RMMSL
efficiently estimates local tangent space by weighted low-rank matrix
factorization. In the global learning stage, we propose a robust manifold
clustering method based on local structure learning results. The proposed
clustering method is designed to get the flattest manifolds clusters by
introducing a novel curved-level similarity function. Our approach is
evaluated and compared to state-of-the-art methods on synthetic data,
handwritten digit images, human motion capture data and motorbike videos. We
demonstrate the effectiveness of the proposed approach, which yields higher
clustering accuracy, and produces promising results for challenging tasks of
human motion segmentation and motion flow learning from videos. 

ICML site has discussion+video
http://icml.cc/discuss/2012/191.html

[pdf]

}}



@InProceedings{leQ-ranzato-ngA-12icml_high-level-features-large-scale-unsupervised,
  author =    {Quoc Le and Marc'Aurelio Ranzato and Rajat Monga and Matthieu Devin and Kai Chen and Greg Corrado and Jeff Dean and Andrew Ng},
 title =     {Building high-level features using large scale unsupervised learning},
  booktitle = {Proceedings of the 29th International Conference on Machine Learning (ICML-12)},
  series =    {ICML '12},
  year =      {2012},
  editor =    {John Langford and Joelle Pineau},
  location =  {Edinburgh, Scotland, GB},
  isbn =      {978-1-4503-1285-1},
  month =     {July},
  publisher = {Omnipress},
  address =   {New York, NY, USA},
  pages=      {81--88},

Quoc Le, Marc'Aurelio Ranzato, Rajat Monga, Matthieu Devin, Greg Corrado, Kai Chen, Jeff Dean, Andrew Ng

---review 2013 soren goyal ~soren/cs365/hw2/index.html

The paper attempts to create and test a Neural Network Architecture which is
capable of learning high level concepts using unlabeled images. This method
is used to detect Human Faces, Human Bodies and Cats.

Training Set Construction

The training dataset was constructed by sampling framesfrom 10 million
YouTube videos. To avoid duplicates, only one image per video was added to
the dataset. Each example is a color image with 200x200 pixels. To check the
proportion of faces in the dataset, the OpenCV face detector was run on 60x60
randomly-sampled patches from the dataset. The number of faces were less than
3% in 100,000 of the randomly sampled patches.

Apart from faces the most common objects on youtube videos are body parts and
pets. Similar to faces, a database of images containing human bodies and cats
was prepared to train the the net.


--Algorithm: Architecture--

This algorithm is built upon three ideas-

Local Receptve Fields:

A small part of the image is mapped to the next layer. This allows for
equivariance in local deformations. What this means is that, if an object
translates locally in the image, the feature formed in the next layer also
moves in the same pattern. Data from a number of such receptive fields is
used to identify the nature of deformation. "Pooling", which is explained
below then removes this deformation. Although local receptive fields are
used, the network is not convolutional: the parameters are not shared across
different locations in the image.

Pooling

The output of nearby "Locally Receptive Neurons" are taken, pooled together
and passed on to the next layer. Pooling typically involves taking the
average of the outputs or the Maximum of the outputs. This algorithm uses L2
poooling.

Local Contrast


This architecture can be viewed as a sparse deep autoencoder. It was
constructed by replicating three times the same stage composed of local
filtering, local pooling and local contrast normalization. The output of one
stage is the input to the next one and the overall model can be interpreted
as a nine-layered network (see Figure).

Learning and Optimization

During learning, the parameters of the second sublayers (H) are fixed to
uniform weights, whereas the encoding weights W1 and decoding weights W2 of
the first sublayers are adjusted using the optimization problem -
"Reconstruction Topographic Independent Component Analysis".

--Experiments and Protocols--

After training the network, it was given a test set of 37,000 images. The set
had a mixture of faces and distractor objects.

The cat face images are collected from the dataset described in (Zhang et
al., 2008). The Human body dataset was created by subsampling at random from
a benchmark dataset (Keller et al., 2009). In both the sets negative images
were added from the ImageNet dataset.

The performance of the network was rated in the terms of the classification
accuracy of the best Neuron. For each Neuron, the maximun and the minimun
activation values were noted. The difference between the minimum and maximum
activation values was divided into 20 levels. The reported accuracy is the
best classification accuracy among 20 thresholds.

Results

Concept      Our        Deep autoencoders     K-means on
             network  3 layers 6 layers      40x40 images
----------------------------------------------------------
Faces 	      81.7% 	 72.3% 	 70.9% 	        72.5%
Human bodies  76.7% 	 71.2% 	 69.8% 		69.3%
Cats  	      74.8% 	 67.5% 	 68.3% 		68.5%

Concept       Random  Best linear Best first     Best 
 	      weights   filter 	  layer neuron 	 neuron
----------------------------------------------------------
Faces 	        67.0%    74.0% 	     71.0% 	 81.7% 
Human bodies 	66.5% 	 68.1% 	     67.2% 	 76.8% 
Cats  		66.0% 	 67.8% 	     67.1% 	 74.6% 



The first two columns show results for method that do not require
training. In the "Random Weights", column the the same network architecture
is used but the weights have just been initialized randomly, without any
training. The "Linear Fit" column shows the result of the best linear filters
selected from 100,000 examples sampled from the training set.

The last column, shows the result by considering the Experimental Protocols
mentioned above. The rate of accuracy for the best neuron here is 81.7% far
more than that of the linear filters. This architecture also outperforms
standard baselines in terms of recognition rates of cats and human body,
achieving a 10% better accuracy.

Object Recognition

The performance of the algorithm to recognize objects, was a leap over the
previous best. With 20,000 categories, this method achieved a 70% relative
improvement over the previous best result.

---

Abstract: We consider the challenge of building feature detectors for
high-level concepts from only unlabeled data. For example, we would like to
understand if it is possible to learn a face detector using only unlabeled
images downloaded from the Internet. To answer this question, we trained a
9-layered locally connected sparse autoencoder with pooling and local
contrast normalization on a large dataset of images (which has 10 million
images, each image has 200x200 pixels). On contrary to what appears to be a
widely-held negative belief, our experimental results reveal that it is
possible to achieve a face detector via only unlabeled data. Control
experiments show that the feature detector is robust not only to translation
but also to scaling and 3D rotation. Also via recognition and visualization,
we find that the same network is sensitive to other high-level concepts such
as cat faces and human bodies. 

discussion+video: http://icml.cc/discuss/2012/73.html


[pdf]

}}



@InProceedings{kulis-jordan-12icml_revisiting-k-means-bayesian-nonparametric,
  author =    {Brian Kulis and Michael Jordan},
 title =     {Revisiting k-means: New Algorithms via Bayesian Nonparametrics},
  booktitle = {Proceedings of the 29th International Conference on Machine Learning (ICML-12)},
  series =    {ICML '12},
  year =      {2012},
  editor =    {John Langford and Joelle Pineau},
  location =  {Edinburgh, Scotland, GB},
  isbn =      {978-1-4503-1285-1},
  month =     {July},
  publisher = {Omnipress},
  address =   {New York, NY, USA},
  pages=      {513--520},

---review 2013 Souraj Misra ~sjmisra/cs365/hw2/Assignment%202.pdf

Despite the flexibility of the Bayesian framework, simpler methods such as
k-means remain the preferred choice in many large-scale applications. A major
motivation for using K-means is its simplicity and scalability: whereas
Bayesian nonparametric models require sampling algorithms or variational
inference techniques which can be difficult to implement and often are not
scalable ,k-means is straightforward to implement and works well for a
variety of applications.  Paper attempts to achieve both worlds by designing
scalable hard clustering algorithms from a Bayesian non parametric viewpoint.

Inspired by the asymptotic connection between k-means and mixture of
Gaussians, that the k-means algorithm may be viewed as a limit of the
Expectation maximization (EM) algorithm-if all the co variances matrices
corresponding to the clusters in the Gaussian mixture model are equal to σI
and we let σ go to zero, the EM steps approach the k- means step in the
limit, the paper shows that a Gibbs sampling algorithm for the Dirichlet
process mixture approaches a hard clustering algorithm in the limit and the
further that the resulting algorithm minimizes an elegant underlying
k-means-like clustering objective that includes penalty for the number of
clusters. This function can be shown to be

[EQN]

This function is simply the k-means objective function with an additional
penalty based on the number of clusters. Thus the algorithm behaves simply as
the K-means algorithm with the exception that a new cluster is formed
whenever a point say (say x) is farther than ʎ away from the existing center
centroid.  Most useful extension is to the Hierarchical Dirichlet Process
(HDP). AS with the hard DP algorithm, we will have a threshold that
determines when to introduce a new cluster .For hard HDP, we will require two
parameters: let ʎl be the “local” threshold parameter and ʎg be the “global”
threshold parameter. Then it can be shown that the following objective
function is minimized:-

[EQN]

We minimize the global K-means objective function, but we incorporate a
penalty whenever either a new cluster or a new global cluster is created.

---

Abstract: Bayesian models offer great flexibility for clustering
applications — Bayesian nonparametrics can be used for modeling infinite
mixtures, and hierarchical Bayesian models can be utilized for shared
clusters across multiple data sets. For the most part, such flexibility is
lacking in classical clustering methods such as k-means. In this paper, we
revisit the k-means clustering algorithm from a Bayesian nonparametric
viewpoint. Inspired by the asymptotic connection between k-means and mixtures
of Gaussians, we show that a Gibbs sampling algorithm for the Dirichlet
process mixture approaches a hard clustering algorithm in the limit, and
further that the resulting algorithm monotonically minimizes an elegant
underlying k-means-like clustering objective that includes a penalty for the
number of clusters. We generalize this analysis to the case of clustering
multiple data sets through a similar asymptotic argument with the
hierarchical Dirichlet process. We also discuss further extensions that
highlight the benefits of our analysis: i) a spectral relaxation involving
thresholded eigenvectors, and ii) a normalized cut graph clustering algorithm
that does not fix the number of clusters in the graph. 

discussion+video at ICML papers site

[pdf]

}}



@inproceedings{chua-marsland-guesgen-11ijc_unsupervised-data-streams-patterns-w-edit-distance,
 title={Unsupervised learning of patterns in data streams using compression and edit distance},
  author={Sook-Ling Chua and Stephen Marsland and Hans W. Guesgen},
},
  booktitle={Proceedings of the 22nd IJCAI Volume Two},
  pages={1231--1236},
  year={2011},
  abstract = {

patterns = repeated data, but may have noise. 

idea: use compression algorithms -> identifies redundancies. 

patterns may not match exactly -> edit distance between patterns

--> key idea:  Compression in the presence of variations

from : ####
[Cilibrasi and Vit´anyi, 2005] Rudi Cilibrasi and Paul M. B. Vit´anyi.
Clustering by compression. IEEE Transactions on Information
Theory, 51(4):1523–1545, 2005.


---review 2013 priyank jaini ~pjaini/cs365/hw2/Paper_Report.pdf

Objective: There are a lot of methods available for recognising patterns in
data streams based on fixed length sequences. However, not much is available
in cases of variable length. In this paper the authors have suggested an
unsupervised learning method that handles variable length data sequences by
identifying structure in the data stream using data compression and edit
distance. The method does both, identifies and segments the data stream into
patterns, i.e. first the segmentation of suitable patterns is done and then
the identification of pattern whenever it is encountered. The method is
tested on both fixed and variable length datasets.

Methodology: The notion of tokens is introduced wherein the data points are
called tokens. So, a pattern is a repeated set of tokens. A dictionary/code
book of tokens is created using the LZW algorithm (lossless data compression
algorithm). The dictionary created by this would be huge as it would contain
all kinds of tokens. But, only certain tokens occurring frequently are of
interest. Therefore, we need the longest frequent words in the dictionary to
identify patterns. Therefore, a reduction of the dictionary is performed
using a lossy compression so that it contains only the prototype words. This
also helps serve the purpose that minor variations in patterns are treated as
equivalent. This is enforced using the idea of edit distance which produces a
measure of similarity between a pair of strings. Using this concept, the
dictionary created is quantised.

Let the data stream be S={d1,d2,…………,dk} and the quantised set of words be
D={w1,w2,…………,wn}. The edit distance for each wr and data stream S is
calculated and stored in a 2-D matrix. A threshold ɛ is chosen of how much
variation is allowed. If there is a perfect match(0 edit distance) then
number of steps is found by going backwards in the matrix. However, when
there is an error the starting point of word boundary can be found by
minimising the following distance {dist[i,j-1], [i-1,[j-1], [i- 1,j]}
i.e. traversing backward and up.

This method was tested on both variable and fixed length datasets and the
results’ accuracy was comparable.  Acknowledgement:Unsupervised learning of
patterns in data streams using compression and edit distance


---
ABSTRACT

Many unsupervised learning methods for recognising patterns in data streams
are based on fixed length data sequences, which makes them unsuitable for
applications where the data sequences are of variable length such as in
speech recognition, behaviour recognition and text classification. In order
to use these methods on variable length data sequences, a pre-processing step
is required to manually segment the data and select the appropriate features,
which is often not practical in real-world applications. In this paper we
suggest an unsupervised learning method that handles variable length data
sequences by identifying structure in the data stream using text compression
and the edit distance between ‘words’. We demonstrate that using this method
we can automatically cluster unlabelled data in a data stream and perform
segmentation. We evaluate the effectiveness of our proposed method using both
fixed length and variable length benchmark datasets, comparing it to the
Self-Organising Map in the first case. The results show a promising
improvement over baseline recognition systems.

[pdf]

}}



@inproceedings{ackerman-benDavid-11_how-to-select-hierarchical-clustering-algo,
 title={Discerning linkage-based algorithms among hierarchical clustering methods},
  author={Margareta Ackerman and Shai Ben-David},
  booktitle={Proceedings of the 22nd IJCAI Volume Two},
  pages={1140--1145},
  year={2011},
  annote = {

--Discerning Linkage-Based Algorithm Among Hierarchical Clustering Methods--

The paper states that focus of study of the properties of clustering
algorihtms had mainly been upto the partitional algorithms that produce a
single partition of the data. The author has come up with mainly two
properties, locality and outer consistency to provide a property-based
characterization of hierarchical linkage based algorithms. The author gives a
formal definition of the dendrogram over (X; d), where X is the dataset and d
is the distance function, as a triple (T; M; ) such that for every leaf
xV (T); (x) = 0 and if (x; y)E(T), then (x) > (y). The author
further defines domain isomorphisms, cluster isomorphism and tree
isomorphisms, and using these isomorphic dendrograms have been defined. The
author has also defined hierchical clustering function, linkage function and
linkage based functions.

The major contribution of the paper is in terms of introduction of two new
properties, namely locality and outer consistency. The author informally
writes that, “Locality states that if a clustering is selected from a
dendrogram and the hierchical algorithm is run on the data underlying this
clustering, the result obtained is consistent with the original
dendrogram”. In general if we separate the points belonging to different
clusters even further apart, for a given subset of clusters of the dataset,
by the choice of a proper distance function which is outer consistent, and
then use a hierarchical algorithm on this dataset then the given clustering
should also be present in the dendrogram thus obtained. Formally, this notion
is expressed in terms of Acut where A is a cluster in the dendrogram.

The author goes further to prove 3 non-trivial characterisation of
linkage-based hierarchical algorithms and its comparison with divisive
algorithm which are stated as under.

Theorem 1. A hierarchical function F is linkage-based if and only if F is
outer-consistent and local.

   In order to prove this the author has introduced a pseudo-partial ordering
   ([pdf]

}}



@inproceedings{lawrence-11_spectral-dimensionality-reduction-via-maximum-entropy-NLDR-MDS,
 title={Spectral dimensionality reduction via maximum entropy},
  author={Lawrence, N.D.},
  booktitle={Proceedings of the international conference on artificial intelligence and statistics},
  pages={51--59},
  year={2011}

Abstract

We introduce a new perspective on spectral dimensionality reduction which
views these methods as Gaussian random fields (GRFs). Our unifying
perspective is based on the maximum entropy principle which is in turn
inspired by maximum variance unfolding.  The resulting probabilistic models
are based on GRFs. The resulting model is a nonlinear generalization of
principal component analysis. We show that parameter fitting in the locally
linear embedding is approximate maximum likelihood in these models.  We
directly maximize the likelihood and show results that are competitive with
the leading spectral approaches on a robot navigation visualization and a
human motion capture data set.

1 Introduction

Spectral approaches to dimensionality reduction involve taking a data set
containing n points and forming a matrix of size n x n from which
eigenvectors are extracted to give a representation of the data in a low
dimensional space. Several spectral methods have become popular in the
machine learning community including isomap [Tenenbaum et al., 2000], locally
linear embeddings [LLE, Roweis and Saul, 2000], Laplacian eigenmaps [LE,
Belkin and Niyogi, 2003] and maximum variance unfolding [MVU, Weinberger et
al., 2004]. These approaches [and kernel PCA, Scholkopf et al., 1998] are
closely related to classical multidimensional scaling [CMDS, Mardia et al.,
1979].


[pdf]

}}



@inproceedings{netzer-ng-11_digits-in-images-unsupervised-feature-learning,
 title={Reading digits in natural images with unsupervised feature learning},
  author={Netzer, Y. and Wang, T. and Coates, A. and Bissacco, A. and Wu, B. and Ng, A.Y.},
  booktitle={NIPS Workshop on Deep Learning and Unsupervised Feature Learning},
  volume={2011},
  year={2011}
  abstract = {

---review 2013 vishal kumawat

INTRODUCTION:-In this paper our aim is to read digit in natural images with
unsupervised learning. It is very difficult to read text in images because
they have different font,  different light intensity, distortion, problem of
getting text from natural background is hard due to corrupted text by
background.  In this we will see how to read house no. printed infront of
house(called SVHN). We will see success of feature learning algorithm in our
scenario and why traditional feature are not efficient.

SVHN DATA SET:-there are two stage to read house number. The first one is
detection that that locates house number in image. Second one is a
recognition stage that performs a search over possible character locations in
the detected house number, classifying each candidate frame as one of ten
digits (0 through 9).the SHVN data are taken from large street view images
.the data consist of 6,00,000 label character available in two format. One is
FULL NUMBER (original, variable resolution, transcription of digits) and
second one is CROPPED DIGIT(all digit resized to 32 cross 32 such that no
effect on aspect ratio distortion).the data set is divided in train
set(70,000 images) ,test set (26000 images), extra train set(5,31,000
images).due to no vast intra-class variation and complex photometric
distortion, recognition is more challenging.

MODELS:- A main thrust of our investigation has been to determine how
features generated by feature learning systems compare to hand-constructed
feature representations that are commonly used in other computer vision
systems.

In HAND CRAFTED FEATURE we have the features which are most popular
are HOG feature an off-the-shelf cocktail of binary image features. In
LEARENED FEATURE there are two main algorithms one is stacked sparse encoder
and other one is K-means based system. A major drawback of many feature
learning systems is their complexity. These algorithms usually require a
careful selection of multiple hyperparameters such as learning rates,  
momentum, sparsity penalties, weight decay, and so on that must be chosen
through cross-validation.

EXPERIMENTAL RESULT:-K-mean algorithm have better accuracy (90%).where
stacked sparse auto encoder have 89.7% accuracy similar to K-means. Hand
Crafted Feature have less accuracy in which HOG have 85%. And BINARY

FEATURE have 63.3%.the Human performance is 98%.Human performance varies
according to height in pixels.

APPLICATION:-it is used in improving map service and automatic detection and
recognition of house number in street view images.

CONCLUSION:-in this paper we have applied unsupervised feature learning
successfully to identify digit in street view images .and we showed this
learning feature have better performance than hand crafted feature learning
process .But we have still room for improvement to reach near human
performance.

---

Detecting and reading text from natural images is a hard computer vision task
that is central to a variety of emerging applications. Related problems like
document character recognition have been widely studied by computer vision
and machine learning researchers and are virtually solved for practical
applications like reading handwritten digits. Reliably recognizing characters
in more complex scenes like photographs, however, is far more difficult: the
best existing methods lag well behind human performance on the same tasks. In
this paper we attack the problem of recognizing digits in a real application
using unsupervised feature learning methods: reading house numbers from
street level photos. To this end, we introduce a new benchmark dataset for
research use containing over 600,000 labeled digits cropped from Street View
images. We then demonstrate the difficulty of recognizing these digits when
the problem is approached with hand-designed features. Finally, we employ
variants of two recently proposed unsupervised feature learning methods and
find that they are convincingly superior on our benchmarks.


[pdf]

}}



@inproceedings{ciresan-meier-11ijc_convolutional-NN-for-image-classification,
 title={Flexible, high performance convolutional neural networks for image classification},
  author={Dan C. Cire{\c{s}}an and Ueli Meier and Jonathan Masci and Luca Maria Gambardella and Jürgen Schmidhuber},
  booktitle={Proceedings of the 22nd IJCAI Volume Two},
  pages={1237--1242},
  year={2011},
  abstract = {

Among the strongest classifiers for MNIST and other domains.

abstract:
We present a fast, fully parameterizable GPU implementation of Convolutional
Neural Network variants. Our feature extractors are neither carefully
designed nor pre-wired, but rather learned in a supervised way. Our deep
hierarchical architectures achieve the best published results on benchmarks
for object classification (NORB, CIFAR10) and handwritten digit recognition
(MNIST), with error rates of 2.53%, 19.51%, 0.35%, respectively. Deep nets
trained by simple back-propagation perform better than more shallow
ones. Learning is surprisingly rapid. NORB is completely trained within five
epochs. Test error rates on MNIST drop to 2.42%, 0.97% and 0.48% after 1, 3
and 17 epochs, respectively.

[pdf]

}}



@article{nair-hinton-09_3d-object-recog-deep,
  title={3D object recognition with deep belief nets},
  author={Vinod Nair and Geoffrey E. Hinton},
  journal={Advances in Neural Information Processing Systems},
  volume={22},
  pages={1339--1347},
  year={2009},
  abstract= {

3D object database : NORB

---review 2013 Vishal K Gupta

Overview:

Recent advancement in DBNs have showed that it is possible to learn multiple
layers of non-linear features without actually using labelled data. Object
recognition is one of the prominent challenges of artificial intelligence
these days and and DBNs on account of their unsupervised feature learning
have contributed a lot in this field. Greedy layerwise training is performed
i.e training is done for one layer at a time as "Restricted Boltzman Machine"
using "Contrastive Divergence". Fine tuning of the net is done using
supervised backpropagation after pretraining.

=> Restricted Boltzman Machine: An invariant of Boltzman Machine, with the
   restriction that each connection must be between a hidden unit and a
   visible unit, unlike recurrent network (having connections among hidden
   units also). Its a kind of model which is generative in nature and tries
   to learn probability distribution over its set of inputs[1][2].

=> Contrastive Divergence: It is one of the methods for trainig RBNs which is
   also fast enough. We propose a probability distribution and try to fit the
   input data to that probability distribution by using many cycles of Markov
   Chain Monte Carlo sampling. Since using many cycles of MCMC sampling is
   time consuming, we use few cycles of sampling in order to compute
   approximate gradient instead of computing accurate gradient. The intution
   behind this is that after a few iterations data will have moved from the
   target distribution towards proposed distribution[1][2][3].

Theory Explained:

Till now DBNs only comprises of two types of observed units i.e one for the
label of the object and one for the penultimate feature vector irrespective
of what is happening at the hidden level(hidden layer). This paper presented
a Third-Order RBM as the top level model in which a joint distribution is
moddled taking hidden unit also into consideration.The figure below presents
a rough idea of initial and present approach.


In this way the parameters form a 3D tensor instead of a matrix as in earlier
case.

An energy function is defined for the model which is given by


where W is a scaleable parameter. And the probability of full configuration
is given by the following equation, where Z represents the partition
function.


Defining label vector as a discrete random variable having K states represented as 1-of-K encoding, it is observed that k-th component of label is if "1", then k-th slice of the tensor defines our energy function.

So we infer that the distribution that we are trying to model is P(l|v) for classification purpose and P(v,l|h) and P(h|v,l) for learning based on Contrastive Divergence. Once l is observed the model reduces back to an RBM with parameters as k-th slice of 3D tensor. The whole computation takes place in O(NvNhNl)

=>Learning:

Contrastive Divergence learning is bascically formulated for bipartite architectured RBM as maximum likelihood learning is intractable. In this paper CD learning is extended for third order RBM also. Maximum likelihood gradient is computed by MCMC algorithm till the sampling chain reaches equilibrium. However, CD learning uses only three half steps of this chain initialized from a training data instead of randomly initializing.

==>Algorithm:

1) For a given training pair {v+,lk+=1}, we sample h+~P(h|v+,lk+=1).

2) Compute the produt D+=v+h+T.

3) Sample {v-,l-}~P(v,l|h+). Let lm-=1.

4) Sample h- ~ P(h|v-,lm-) = 1.

5) Compute the outer product D- = v-h-T. 

After all this we will get W.,.,k matrix which is Nh x Nv corresponding to k-th slice of 3D Tensor.

(delta)Wg = D+ - D-.

Therefore, W.,.,knew = W.,.,kold + n*(delta)W.,.,k , where n is the learning rate.

Another method of updating the parameters is through discriminative update rule: ML learning of P(l|v) 

1) Compute logP(lc = 1|v+).

2) Hybrid Learning Rule is given by 

The two main advantages of learning as explained above are: 
1) It avoids slow mixing of the CD learning P(v,l). 

2) And it allows learning from both labelled as well as unlabelled data.

Putting it other way, we can say that it optimizes conditional distributions
in place of joint distribution, like optimizing pseudo-likelihood. For
improving discriminative performance, we are using binary features that are
only rarely active.Error measure is performed by comparing and cross checking
entropy between the desired and actual distributions.  Evaluation and
Validation of the model:

The NORB Database consisted of five objects namely animals, humans,planes,
trucks, and cars. Normalized version of the data base has been used. Each
training example is a stereo pair of gray-scale images, size: 96x96.Based on
the above explained algorithm different models are built like shallow models
and deep models.It is obvious that deep model must perform better than
shallow model. Introducing even one pre-trained layer, trained greedily,
decreased the error rate a lot as compared to shallow model. It has been
observed that third order RBM outperformed standard RBM top level model
having same number of hidden units. Comparing "hybrid learning rule" and "CD"
learning rule for the top level RBN, there is not much difference cited
between the two error rates, however, hybrid is better.

Conclusion:

The two main points to conclude are: 

1) Using generative model in object recognition gives a lot better result
because it produces a representation of the input image which gives more
acurate result as compared to raw pixels.

2) Including P(v|l) in the learning rule along with P(l|v) from trainig the
top level model improved the performace of the classification task.

References:

[1] Wikipedia
[2] www.deeplearning.net
[3] http://www.robots.ox.ac.uk/~ojw/files/NotesOnCD.pdf

[4] Images are taken from the review paper itself.

---

We introduce a new type of top-level model for Deep Belief Nets and evaluate
it on a 3D object recognition task. The top-level model is a third-order
Boltzmann machine, trained using a hybrid algorithm that combines both
generative and discriminative gradients. Performance is evaluated on the
NORB database (normalized-uniform version), which contains stereo-pair
images of objects under different lighting conditions and viewpoints. Our
model achieves 6.5% error on the test set, which is close to the best published
result for NORB (5.9%) using a convolutional neural net that has
built-in knowledge of translation invariance. It substantially outperforms
shallow models such as SVMs (11.6%). DBNs are especially suited for
semi-supervised learning, and to demonstrate this we consider a modified
version of the NORB recognition task in which additional unlabeled images
are created by applying small translations to the images in the database.
With the extra unlabeled data (and the same amount of labeled data as
before), our model achieves 5.2% error.

[pdf]

}}



@article{yangJ-huangT-10eccv_efficient-sparse-coding-mixture-model,
 title={Efficient highly over-complete sparse coding using a mixture model},
  author={Yang, J. and Yu, K. and Huang, T.},
  journal={Computer Vision--ECCV 2010},
  pages={113--126},
  year={2010},
  publisher={Springer}

Abstract. Sparse coding of sensory data has recently attracted notable
attention in research of learning useful features from the unlabeled data.
Empirical studies show that mapping the data into a significantly higher-
dimensional space with sparse coding can lead to superior classification
performance. However, computationally it is challenging to learn a set of
highly over-complete dictionary bases and to encode the test data with
the learned bases. In this paper, we describe a mixture sparse coding
model that can produce high-dimensional sparse representations very
eciently. Besides the computational advantage, the model effectively
encourages data that are similar to each other to enjoy similar sparse
representations. What's more, the proposed model can be regarded as an
approximation to the recently proposed local coordinate coding (LCC),
which states that sparse coding can approximately learn the nonlinear
manifold of the sensory data in a locally linear manner. Therefore, the
feature learned by the mixture sparse coding model works pretty well
with linear classifiers. We apply the proposed model to PASCAL VOC
2007 and 2009 datasets for the classification task, both achieving state-
of-the-art performances.

[pdf]
}}





@InProceedings{tangY-salakhutdinov-hinton-12icml_deep-mixtures-factor-analysers,
  author =    {Yichuan Tang and Ruslan Salakhutdinov and Geoffrey Hinton},
 title =     {Deep Mixtures of Factor Analysers},
  booktitle = {Proceedings of the 29th International Conference on Machine Learning (ICML-12)},
  series =    {ICML '12},
  year =      {2012},
  editor =    {John Langford and Joelle Pineau},
  location =  {Edinburgh, Scotland, GB},
  isbn =      {978-1-4503-1285-1},
  month =     {July},
  publisher = {Omnipress},
  address =   {New York, NY, USA},
  pages=      {505--512},

reviews standard ideas in deep learning and suggests a mechanism for greedy
learning of each layer-wise weights in a restricted boltzmann machine (RBN). 


--13 365 Ankit Agarwal DeepMixtures ofFactor Analysers.htm--

In the Experiment they first trained a MFA model on 30,000 24×24 face images
from the Torouto face database(TFD), with 288 factors and 100
components. Then a DMFA is trained with 20 first layer components and 5 second
layer components for each of the 20 first layer components. The DMFA has the
same number of parameters as the baseline MFA. Qualitatively, the DMFA
appears to generate bettersamples compared to the shallow MFA model.

As density models, MFAs significantly outperform undirected RBM models for
real-valued data and by using second layer MFAs to model the aggregated
posterior of each first layer factor analyser, we can achieve substantial
gains in performance. Higher input dimensionality leads to bigger gains from
learning DMFAs.  However, adding a third MFA layer appears to be of little
value. Another possible extension is to train a mixture of linear dynamical
systems and then to train a higher-level mixture of linear dynamical systems
to model the aggregated posterior of each component of the first level
mixture.

Abstract: An efficient way to learn deep density models that have many layers
of latent variables is to learn one layer at a time using a model that has
only one layer of latent variables. After learning each layer, samples from
the posterior distributions for that layer are used as training data for
learning the next layer. This approach is commonly used with Restricted
Boltzmann Machines, which are undirected graphical models with a single
hidden layer, but it can also be used with Mixtures of Factor Analysers
(MFAs) which are directed graphical models. In this paper, we present a
greedy layer-wise learning algorithm for Deep Mixtures of Factor Analysers
(DMFAs). Even though a DMFA can be converted to an equivalent shallow MFA by
multiplying together the factor loading matrices at different levels,
learning and inference are much more efficient in a DMFA and the sharing of
each lower-level factor loading matrix by many different higher level MFAs
prevents overfitting. We demonstrate empirically that DMFAs learn better
density models than both MFAs and two types of Restricted Boltzmann Machines
on a wide variety of datasets. 

discussion+video: http://icml.cc/discuss/2012/284.html

[pdf]

}}




@InProceedings{varoquaux-gramfort-12icml_small-sample-fmri-spatial-clustering,
  author =    {Gael Varoquaux and Alexandre Gramfort and Bertrand Thirion},
 title =     {Small-sample brain mapping: sparse recovery on spatially correlated designs with randomization and clustering},
  booktitle = {Proceedings of the 29th International Conference on Machine Learning (ICML-12)},
  series =    {ICML '12},
  year =      {2012},
  editor =    {John Langford and Joelle Pineau},
  location =  {Edinburgh, Scotland, GB},
  isbn =      {978-1-4503-1285-1},
  month =     {July},
  publisher = {Omnipress},
  address =   {New York, NY, USA},
  pages=      {1375--1382},
  annote = {


Abstract: Functional neuroimaging can measure the brain’s response to an
external stimulus. It is used to perform brain mapping: identifying from
these observations the brain regions involved. This problem can be cast into
a linear supervised learning task where the neuroimaging data are used as
predictors for the stimulus. Brain mapping is then seen as a support recovery
problem. On functional MRI (fMRI) data, this problem is particularly
challenging as i) the number of samples is small due to limited acquisition
time and ii) the variables are strongly correlated. We propose to overcome
these difficulties using sparse regression models over new variables obtained
by clustering of the original variables. The use of randomization techniques,
e.g. bootstrap samples, and hierarchical clustering of the variables improves
the recovery properties of sparse methods. We demonstrate the benefit of our
approach on an extensive simulation study as well as two publicly available
fMRI datasets. 

discussion+video: http://icml.cc/discuss/2012/688.html


[pdf]

}}



@InProceedings{chen-carson-12icml_information-theoretic-linear-discriminant-analysis,
  author =    {Minhua Chen and William Carson and Miguel Rodrigues and Robert Calderbank and Lawrence Carin},
 title =     {Communications Inspired Linear Discriminant Analysis},
  booktitle = {Proceedings of the 29th International Conference on Machine Learning (ICML-12)},
  series =    {ICML '12},
  year =      {2012},
  editor =    {John Langford and Joelle Pineau},
  location =  {Edinburgh, Scotland, GB},
  isbn =      {978-1-4503-1285-1},
  month =     {July},
  publisher = {Omnipress},
  address =   {New York, NY, USA},
  pages=      {919--926},
  annote = {

dimensionality reduction based classification. 

--review 2013 AYUSH JAIN --

Analysis of higher dimentional data is a challange due to high computaional
costs involved.  We get high dimentional data in many cases in particular image
segmentaion and classification. We generaly do dimensionality reduction on
such high dimention data from image pixals etc for computaion and other
purposes.  So focus of the paper is on problem of dimentionality reduction
using information theoretic techniques. In this supervised dimensionality
reduction is done by designing projection matrix such that Mutual Information
is maximum between projection and corresponding class lable so that we can
classify class labels correctly with high probability.


In this paper instead of shannon mutual information qudratic mutual
information is used because it is easy to calculate than shannon entropy and
GRADIANT of MI can be calculated anlyticaly.  Contribution of this paper is to
show that the use of Shannon MI optimization, and to show Shannon MI
objective function can be optimized directly without any approximation. for
linear feature design in classification, can be solved directly, without
compromising or simplifying the objective function.

Algorithm:

We assume that label and data are generated i.i.d. via the following process:
 c ~ Mult(c; 1;w); x|c ~ p(x|c) ;where w is the prior distribution on the M
 classes, x is the original signal, and p(x|c)is the data distribution for
 class c. Hence the joint density is p(x; c) = wcp(x|c) In supervised
 dimensionality reduction, we seek a projection matrix "psi" such that the
 projected signal y = psi*x + epsilon

1. Obtain the input signal distribution in (1) from training data. Initialize
	psi.  
2. Compute the equivalent MMSE matrix in (7) via Monte Carlo simulation.
3. Compute the gradient in (6) and update the projection matrix as psi~
	orth(psi + K GRad(I(C;Y ))),where K is the step  size and orth(A)
	means projecting A to an orthonormal matrix. 
4. If converge, stop. Otherwise, go to step 2.	

Tests were performed on three real datasets: Satellite,Letter and USPS. And
Four dimensionality reduction methods are considered: LDA, IDA, the quadratic
Renyi entropy based method with objective function I2(C;Y ), and the proposed
Shannon entropy based method.  The performance of the proposed method was very
promising.the proposed method either gives the best performance, or is very
near to the best.

This paper succesfully aplies the concepts of communication theory in machine
learning to produce promising outcomes and shows how a research product from
one area (communications) can benefit research in a seemingly different
area(machine learning).

---

Abstract: We study the problem of supervised linear dimensionality reduction,
taking an information-theoretic viewpoint. The linear projection matrix is
designed by maximizing the mutual information between the projected signal
and the class label (based on a Shannon entropy measure). By harnessing a
recent theoretical result on the gradient of mutual information, the above
optimization problem can be solved directly using gradient descent, without
requiring simplification of the objective function. Theoretical analysis and
empirical comparison are made between the proposed method and two closely
related methods (Linear Discriminant Analysis and Information Discriminant
Analysis), and comparisons are also made with a method in which Renyi entropy
is used to define the mutual information (in this case the gradient may be
computed simply, under a special parameter setting). Relative to these
alternative approaches, the proposed method achieves promising results on
real datasets. 

video on ICML papers site


[pdf]

}}



@InProceedings{boots-gordon-12icml_two-manifold-merging-from-separate-views,
  author =    {Byron Boots and Geoff Gordon},
 title =     {Two-Manifold Problems with Applications to Nonlinear System Identification},
  booktitle = {Proceedings of the 29th International Conference on Machine Learning (ICML-12)},
  series =    {ICML '12},
  year =      {2012},
  editor =    {John Langford and Joelle Pineau},
  location =  {Edinburgh, Scotland, GB},
  isbn =      {978-1-4503-1285-1},
  month =     {July},
  publisher = {Omnipress},
  address =   {New York, NY, USA},
  pages=      {623--630},
  url =       {http://arxiv.org/abs/1206.4648},
  annote = {

Abstract: Recently, there has been much interest in spectral approaches to
learning manifolds—so-called kernel eigenmap methods. These methods have had
some successes, but their applicability is limited because they are not
robust to noise. To address this limitation, we look at two-manifold
problems, in which we simultaneously reconstruct two related manifolds, each
representing a different view of the same data. By solving these
interconnected learning problems together, two-manifold algorithms are able
to succeed where a non-integrated approach would fail: each view allows us to
suppress noise in the other, reducing bias. We propose a class of algorithms
for two-manifold problems, based on spectral decomposition of
cross-covariance operators in Hilbert space and discuss when two-manifold
problems are useful. Finally, we demonstrate that solving a two-manifold
problem can aid in learning a nonlinear dynamical system from limited data.

discussion+video on ICML site. 
http://icml.cc/discuss/2012/338.html


[pdf]

}}

SUPERVISED ACTION / SEQUENCE CLASSIFICATION




@inproceedings{kaiser-12_learning-game-rules-from-videos,
 title={Learning Games from Videos Guided by Descriptive Complexity},
  author={Kaiser, {\L}.},
  booktitle={Twenty-Sixth AAAI Conference on Artificial Intelligence},
  year={2012},
  annote = {

information theoretic approach for learning game rules by watching
videos of the games.

--review: Akshay Agrawal (10058)--

==OBJECTIVE: To learn board games (Double player games only; like Tic-Tac-Toe,
Connect4, Pawns, Breakthrough) from a given set of videos.

Set of videos includes cases when:
Player 1 wins, Player 2 wins, illegal moves are made, won by none
(The number of training videos of each kind varies proportionately to the
number of permutations of each  kind possible)

==APPROACH SUGGESTED:
(All the Descriptions made are with respect to a 3*3 grid, Descriptions may
vary with the choice of board  structure (e.g. triangle, grid etc.)) 

--Collection of data from observation into modifiable data structures:

  Data structure used - Relational Structure
  Description of 'Relational Predicates' used:
  R(x,y) => (&y == &x+3) : 'TRUE' if y follows x along a row, 'FALSE' otherwise

  C(x,y) => (&y == &x+1) : 'TRUE' if y lies above x in a column, 'FALSE'
	 otherwise {This definition is only descriptive, lacks an additional
	 factor} 
  diag(x,y) => 'TRUE' if for some 'u' (R(x,u) && C(u,y)), 'FALSE' otherwise
	 {i.e. 'u' lies right of 'x' and 'y'  above 'u'} 
  lastRow(x) => 'TRUE' if for no 'u' C(x,u), 'FALSE' otherwise {i.e. no 'u'
	 exists below 'x'} 

All these predicates and may be some more, corresponding to a certain state
of a game are filled up by image analysis results

--Designing appropriate logic formulas ("RULES") 
as per the conditions for win or loss or tie learnt:

  Considering each video as a sequence of structures, all possible pairs are
  classified under appropriate  condition, depending on the label of the
  video being learnt (1Win, 2Win, Illegal, No Win etc.) 

   - Learning Winning Condition: e.g. For some x1(W(x1)^For no x0:C(x1, x0))
	implies "White wins if there is no other piece below x0 for atleast
	one position of white piece"
   - Learning Legal Moves/ Illegal moves

--Game Playing Algorithm:

Algorithm to decide the next move based on the expectation of the player by
analysing the possible states after each move and selecting the most adequate
ones from the set of interest.

Technically this is done by assuming the current structure/universe as a node
of UCT tree, and all the possible legal moves from this state as the children
nodes to this node.

Then probabilities of win|loss|tie are determined along each path of the tree and the one with most 
P(win) is adopted, and accordingly a move is made.
Drawback: Can prove to be very slow for a larger game structure with more relational structures defining 
the states.
Major Contribution:
- Following equivalences are used: Games as mathematical structures and moves as logical rules.
- Resizable number of 'Relational Predicates' can be used.
- Logic is used for rules definition (Second-order logic with additional real arithmetic has been used).

References:
- http://toss.sourceforge.net/eval.html
- http://toss.sourceforge.net/create.html

abstract:
In recent years, several systems have been proposed that learn the rules of a
simple card or board game solely from visual demonstration. These systems
were constructed for specific games and rely on substantial background
knowledge.  We introduce a general system for learning board game rules from
videos and demonstrate it on several well-known games.  The presented
algorithm requires only a few demonstrations and minimal background
knowledge, and, having learned the rules, automatically derives position
evaluation functions and can play the learned games competitively. Our main
technique is based on descriptive complexity, i.e. the logical means
necessary to define a set of interest. We compute formulas defining allowed
moves and final positions in a game in different logics and select the most
adequate ones. We show that this method is well-suited for board games and
there is strong theoretical evidence that it will generalize to other
problems.


[pdf]

}}



@inproceedings{blasiak-rangwala-11ijc_hmm-variant-for-sequence-classifiction,
 title={A hidden markov model variant for sequence classification},
  author={Sam Blasiak and Huzefa Rangwala},
  booktitle={Proceedings of the 22nd IJCAI Volume Two},
  pages={1192--1197},
  year={2011},
  organization={AAAI Press}
  abstract = {

sequence learning can be used to learn protein structures or actions.
sequences are often of unequal length and hard to compare.  Here, a variant
of HMM is developed which uses a single emission matrix for the initial
states of the data, and generates differing transition matrices for each data
sequence.  Since the transition matrices are of the same length, they can be
used to classify the sequences, e.g. by an SVM.

abstract:
Sequence classification is central to many practical problems within machine
learning. Distances metrics between arbitrary pairs of sequences can be hard
to define because sequences can vary in length and the information contained
in the order of sequence elements is lost when standard metrics such as
Euclidean distance are applied. We present a scheme that employs a Hidden
Markov Model variant to produce a set of fixed-length description vectors
from a set of sequences. We then define three inference algorithms, a
Baum-Welch variant, a Gibbs Sampling algorithm, and a variational algorithm,
to infer model parameters. Finally, we show experimentally that the fixed
length representation produced by these inference methods is useful for
classifying sequences of amino acids into structural classes

[pdf]

}}



@inproceedings{kerr-11_activity-recognition-w-FSM,
 title={Activity recognition with finite state machines},
  author={Wesley Kerr and Anh Tran and Paul Cohen},
  booktitle={Proceedings of the 22nd IJCAI Volume Two},
  pages={1348--1353},
  year={2011},
  organization={AAAI Press}
  abstract = {

--- 2013 ~jdhakar/cs365/hw2/Paper%20review.pdf

This paper is mainly concerned about a general approach to activity
recognition using Finite State Machine (FMS). It uses some training data set 
specified as propositions that have truth values. For the classification of fluent 
which varies in the intervals of time it is not necessary to use whole data for 
classification and generalisation of behaviour of activity as all the component 
of activity might be repeating during that interval so we reduce them to single 
characteristic in this way generate compressed bit array (CBA) from given data 
set of propositions propositional multivariate time series (PMTS). Here we are 
trying to learn generalisation of instances of activities for that we have to 
establish relationships between fluent used in instances of activity which can 
be done using Allen relations. The relationship is established in the form of 
qualitative sequence which can be obtained using by first normalizing 
[Winarko and Roddick, 2007] and sorting based on first finishing time than 
using the algorithm Make-Sequence as described in paper. By using 
interaction-window size of qualitative sequence can be shortened. Signatures 
are formed from qualitative sequence of activity which is ordered sequence of 
weighted Allen relations. Signatures are updated using dynamic programming 
with Needleman-Wunsch global sequence alignment algorithm1
. Only Insert or 
match is allowed for sequence alignment to prevent data loss and data is 
pruned after a fixed number of training instances based on weight of element 
so that size of signature did not get higher. Now the main part is to decide the 
less important fluent and which is done using weights of each relation 
respective to a fluent if no one exceeds a threshold then it will be considered 
as less important and can be removed from formation FSM using CBAs. Now 
FSMs with respect to every instance are merged and final state of those 
instances is made accepting state in final FSM and this FSM is used as 
recogniser. Now we will have multiple trained FSM’s one for each activity.
“Now we count true positive(tp) for FSM_i recognises instance of activity i, 
False positive(fp) for FSM_i recognizes an instance of activity j, true negative 
(tn) for FSM_i accept an instance of activity j, false negative (fn) FSM_i rejects 
an instance of activity i. Recall is defined asrec=tp/(tp+fn) and precision as 
asprec = tp/(tp+fp). We report the F1measure: 2(prec×rec)/(prec+rec).”[Kerr 
et al., 2011]. Data used for analysis is of 3 different categories first Wubble 
World 3D (ww3d), [Kerr et al., 2008],  second Wubble World 2D (ww2d) and third from individual’s handwriting [Kerr, 2010]. And recognition of previously
unseen test cases is evaluated based on F1 score. And according to the
characteristic of data it is observed that as performance increases quickly with 
increase in training data. In case if one activity is observed to be the 
component of other activity than performance of component part would be 
lower as it would be accepted false positive (fp) for other activity and this 
would lower the performance. Generally the research on temporal patterns in
interval time series is based on extracting classification or patterns based on 
threshold frequency. Previously temporal patterns were described by Allen 
relations and an Apriori -like algorithm [Agrawal and Srikant, 1994] But this 
paper introduced a new way represent them as Finite State Machine(FSM). 
“Overall, the recognizers perform with F1 scores at or above 0.7 across all three 
datasets.” [Kerr et al., 2011]
References:
[Kerr et al., 2011] Wesley Kerr, Paul Cohen and Anh Tran, Activity Recognition 
with Finite State Machines,  Proc. IJCAI’11 International Joint Conference on 
Artificial Intelligence 22(2):1348-1353, 2011.
[Winarko and Roddick, 2007] Edi Winarko and John F Roddick. ARMADA – An 
Algorithm for Discovering Richer Relative Temporal Association Rules from 
Interval-Based Data. Data & Knowledge Engineering, 63:76–90, 2007.
[Kerr, 2010] Wesley N. Kerr. Learning to Recognize Agent Activities and 
Intentions. PhD thesis, University of Arizona, 2010.
[Agrawal and Srikant, 1994] Rakesh Agrawal and Ramakrishnan Srikant. Fast 
Algorithms for Mining Association Rules. In Proceedings of the 20th VLDB 
Conference, 1994.

---

This paper shows how to learn general, Finite State Machine representations
of activities that function as recognizers of previously unseen instances of
activities. The central problem is to tell which differences between
instances of activities are unimportant and may be safely ignored for the
purpose of learning generalized representations of activities. We develop a
novel way to find the "essential parts" of activities by a greedy kind of
multiple sequence alignment, and a method to transform the resulting
alignments into Finite State Machine that will accept novel instances of
activities with high accuracy.


[pdf]

}}




@inproceedings{albanese-subrahmanian=11ijcai_finding-unexplained-activity-in-video,
 title={Finding unexplained activities in video},
  author={Massimiliano Albanese and Cristian Molinaro and Fabio Persia and Antonio Picariello and V. S. Subrahmanian},
  booktitle={Proceedings of the 22nd IJCAI Volume Two},
  pages={1628--1634},
  year={2011},
  abstract = {

---review 2013 sampath kumar ~sheshu/cs365/HW2/Finding%20Unexplained%20Activities%20in%20Video.pdf

Video surveillance is omnipresent. For instance, airport baggage areas are
continuously monitored for suspicious activities.  In crime-ridden
neighborhoods, police often monitor streets and parking lots using video
surveillance. In Israel, highways are monitored by a central authority for
suspicious activities. However, all these applications search for known
activities – activities that have been identified in advance as being either
“normal” or “abnormal”.

A stochastic activity is a labeled directed graph A = (V, E, δ, ρ) where
 *  V is a finite set of nodes labeled with action symbols from S

 *  E ⊆ V × V is a set of edges
 *  δ : E → N+ is a function that associates, with each edge e = (vi, vj), an
	upper bound on the time that can elapse between vi and vj 
 *  ρ is a function that associates, with each node v ∈ V having out-degree 1
	or more, a probability distributionon 
	{ |  ∈ E}, i.e., sigm(∈ E)ρ() = 1;
 *  {v ∈ V | there doesn’t exist v’ ∈ V s.t.  ∈ E} = ∅, i.e., there
 exists at least one start node in the activity   definition

Unexplained Activity Probability Model

We start by noting that the definition of probability of an activity occurring in a video can implicitly involve 
conflicts.

Possible world A possible world is a set of activity occurrences which do not conflict with one another, 
i.e., an action symbol in a frame cannot belong to two distinct activity occurrences in the same world.
!~* To denote the transitive closure of !~(conflict).!~* is an equivalence relation and determines a partition 
of O into equivalence classes O1, . . ., Om.

A Conflict-Based Partitioning (CBP) of v (w.r.t. l) is a sequence <(v1,l1), . . . , (vm, lm)> such that:
• v1 …. vm = v;
• li is the restriction of l to vi, i.e., it is a labeling of vi s.t ∀f ∈ vi, _i(f) = _(f), for 1 ≤ i ≤ m; and
• O(vi, li) = Oi, for 1 ≤ i ≤ m.

Totally unexplained activities theorem : Let  be a UAP instance. Given (f, s) s.t. f ∈ v and 
s ∈ _(f), let ε = sigma(o∈O s.t. (f,s)∈o) p∗(o) *Weight(o)/sigma(oj∈C(o)) Weight(oj). Ifε > 1−τ,  then there 
does not exist a sequence S occurring in v s.t. (f, s) ∈ S and PT (S) ≥ τ .

Partially unexplained activities theorem: Let<(v1, l1), . . . , (vm, lm)> be a CBP and 
be the sub-videos containing a sequence S occurring in v. Let v∗ = vi ・ . . . ・ vi+n and _∗ be a labeling for 
v∗ s.t., for every f ∈ v∗, l∗(f) = l(f). IP (S) computed w.r.t. v and l is equal to IP (S) computed w.r.t. v∗ and l∗.

Algorithms are developed for finding the unexplained activities using the above theorems.

---

Consider a video surveillance application that monitors some location. The
application knows a set of activity models (that are either normal or
abnormal or both), but in addition, the application wants to find video
segments that are unexplained by any of the known activity models — these
unexplained video segments may correspond to activities for which no previous
activity model existed. In this paper, we formally define what it means for a
given video segment to be unexplained (totally or partially) w.r.t. a given
set of activity models and a probability threshold. We develop two algorithms
– FindTUA and FindPUA – to identify Totally and Partially Unexplained
Activities respectively, and show that both algorithms use important pruning
methods. We report on experiments with a prototype implementation showing
that the algorithms both run efficiently and are accurate.

[pdf]

}}




@inproceedings{abolhassani-clark-11ijc_gaze-based-task-classification-w-HMMs,
 title={Visual task inference using hidden markov models},
  author={Amin Haji Abolhassani and James J. Clark},
  booktitle={Twenty-Second International Joint Conference on Artificial Intelligence},
  year={2011},
  abstract = {

---

Visual task, such as reading, counting involves mainly eye movement
patterns. Yarbus Process shows that different eye movement trajectories
emerge depending on the visual task that the viewers are given. Our main
objective is to develop an inverse Yarbus Process where we can infer the
visual task when we are given measurements of a viewer’s eye movements as the
input. “The method we are proposing is to use Hidden Markov Models (HMMs) to
create a probabilistic framework to infer the viewer’s task from eye
movements.”

Suppose we have data of the form , where y∈Y is a task label (e.g.,
reading, counting etc.) and Q is the vector containing the observation
sequence of fixation locations (q1,q2, ..., qT) sampled from a stochastic
process(not deterministic) {qt} in discrete time t ={1,2, ..., T} over random
image locations; hence discrete-state-space S=[s1,s2, ..., sN],so that

		qi =sj|i∈[1,T],j∈[1,N]

We can then find the probability of task y for a given input sequence Q using
Bayes’ s Rule

To to get the likelihood term, we propose two models here. They are:
   *  Bottom-Up Models
   *  Top-Down Models

Bottom-Up Models:

This is task-independent. It is based only on the input image. i.e in this
model we assume that the observation qi are conditionally independent. So our
equation (2) reduces as follows.

P(Q|y)= P(Q) = P(q1,q2,...,qT) = P(q1).P(q2)...P(qT)

This is called Naïve Bayes Assumption.

In this model, location with highest salience is taken as the current Focus
Of Attention(FOA).

But this gives very poor results.

--Top Down Models--

So from above observation we came to know that attention is not just
saliency-dependent, rather it depends mainly on the ongoing task and saliency
of the targets.

This is task dependent method.

So Likelihood term from equation(2) can be modified as:
	P(Q|y) = P(q1,q2,...,qT|y) = P(q1|y)P(q2|y)...P(qT|y)

Even here we can observe that we are still assuming Naïve Bayes Assumption
and observations are conditionally independent. So there are still some
problems with Top-Down Model.

--Hidden Markov Model--

So to allow dependencies between the attributes q1,q2,q3,……,qT we propose
first-order, discrete-time, discrete-state-space Markov chains to model the
attention. According to this model, the

Likelihood term from equation(2) now becomes as follows:

	P(Q|y) = P(q1,q2,...,qT) = P(q1|y)P(q2|q1,y)...P(qT|qT−1,y)

So this allows dependencies between consecutive fixations and make inferences
about the task.

By plugging in this likelihood term in equation(1) we can infer the task.

However, in real life, human attention generally deviates from the locus of
fixation and also due to the factors like equipment bias etc, it leads to
covert attention and so makes FOA covert and so hard to track. To solve this
Hidden Markov Models (HMMs).

--HMMs: (To deal with Covert Attention)--

A typical discrete-time, continuous HMM λ can be defined by a set of parameters λ = (A,B,Π)
where A is the state transition probabilities, B is the Observation pdf and Π is initial state distribution.

HMMs mainly deal with 3 problems: evaluation, decoding and training. They can be solved using 
forward, Viterbi and Baum-Welch algorithms respectively. Evaluation gives probability of observing the 
sequence given the HMM, i.e.,P(O|λ) where O represents the sequence of observations.

In our model, covert attention is represented by the hidden states of a
task-dependent HMM λy(where λy is equivalent to y. i.e both of them represent
a specific task). Fixation locations, thus, will correspond to the
observations of an HMM and can be used in training task-dependent models and
evaluating the probability P(O|λy). And finally O corresponds to Q in our
equation (1). So (1) is modified...

So, after training the task-dependent HMM λy for each task, we can calculate
the likelihood term P(O|λy) by applying the parameters of λy to the forward
algorithm.

So in this way, we can make inferences about the task when we are given an
eye-trajectory as input by plugging in the Likelihood term in the final
equation shown and the task y that has maximum P(λy|O) value is regarded as
the required task that should be given as output.

---

It has been known for a long time that visual task, such as reading, counting
and searching, greatly influences eye movement patterns. Perhaps the best
known demonstration of this is the celebrated study of Yarbus showing that
different eye movement trajectories emerge depending on the visual task that
the viewers are given. The objective of this paper is to develop an inverse
Yarbus process whereby we can infer the visual task by observing the
measurements of a viewer’s eye movements while executing the visual task. The
method we are proposing is to use Hidden Markov Models (HMMs) to create a
probabilistic framework to infer the viewer’s task from eye movements.


[pdf]

}}



@article{taylor-hinton-roweis-11_distributed-models-high-dim-time-series,
title={Two Distributed-State Models For Generating High-Dimensional Time Series},
author={Taylor, G.W. and Hinton, G.E. and Roweis, S.T.},
journal={Journal of Machine Learning Research},
volume={12},
pages={1025--1068},
year={2011}
annote = {

[models activity as low-dimensional models; can actually generate animations]

ABSTRACT:
In this paper we develop a class of nonlinear generative models for
high-dimensional time series.  We first propose a model based on the
restricted Boltzmann machine (RBM) that uses an undirected model with binary
latent variables and real-valued "visible" variables. The latent and visible
variables at each time step receive directed connections from the visible
variables at the last few time-steps. This "conditional" RBM (CRBM) makes
on-line inference efficient and allows us to use a simple approximate learning
procedure. We demonstrate the power of our approach by synthesizing various
sequences from a model trained on motion capture data and by performing
on-line filling in of data lost during capture.  We extend the CRBM in a way
that preserves its most important computational properties and introduces
multiplicative three-way interactions that allow the effective interaction
weight between two variables to be modulated by the dynamic state of a third
variable. We introduce a factoring of the implied three-way weight tensor to
permit a more compact parameterization. The resulting model can capture
diverse styles of motion with a single set of parameters, and the three-way
interactions greatly improve its ability to blend motion styles or to
transition smoothly among them.  Videos and source code can be found at
http://www.cs.nyu.edu/Ëœgwtaylor/publications/jmlr2011

[pdf]

}}




@inproceedings{leQ-zou-ngA-11_learning-hierarchical-temporal-action-recognition,
 title={Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis},
  author={Le, Quoc V. and Zou, W.Y. and Yeung, S.Y. and Ng, Andrew Y.},
  booktitle={Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on},
  pages={3361--3368},
  year={2011},
  annote = {

unlike approaches that hand-craft the features to use for action recognition,
the features emerge as invariant under Independent subspace analysis.

results improve by 5% on state-of-the-art for the difficult hollywood2 and
YouTube action recogn datasets.  Code and learned features available for
download.


[pdf]

}}




@article{luiYM-12_matrix-manifolds-for-computer-vision,
 title={Advances in matrix manifolds for computer vision},
  author={Lui, Y.M.},
  journal={Image and Vision Computing},
  volume={30},
  number={6},
  pages={380--388},
  year={2012},
  publisher={Elsevier}

applications to activity reconition : motions can be grouped using manifolds
etc.

ABSTRACT

The attention paid to matrix manifolds has grown considerably in the computer
vision community in recent years. There are a wide range of important
applications including face recognition, action recognition, clustering,
visual tracking, and motion grouping and segmentation. The increased
popularity of matrix manifolds is due partly to the need to characterize
image features in non-Euclidean spaces. Matrix manifolds provide rigorous
formulations allowing patterns to be naturally expressed and classified in a
particular parameter space. This paper gives an overview of common matrix
manifolds employed in computer vision and presents a summary of related
applications. Researchers in computer vision should find this survey
beneficial due to the overview of matrix manifolds, the discussion as well as
the collective references.


[pdf]

}}



@InProceedings{jawanpuria-nath-12icml_convex-feature-learning-for-latent-task-structure,
  author =    {Pratik Jawanpuria and J. Saketha Nath},
  title =     {A Convex Feature Learning Formulation for Latent Task Structure Discovery},
  booktitle = {Proceedings of the 29th International Conference on Machine Learning (ICML-12)},
  series =    {ICML '12},
  year =      {2012},
  editor =    {John Langford and Joelle Pineau},
  location =  {Edinburgh, Scotland, GB},
  isbn =      {978-1-4503-1285-1},
  month =     {July},
  publisher = {Omnipress},
  address =   {New York, NY, USA},
  pages=      {137--144},

Abstract: This paper considers the multi-task learning problem and in the
setting where some relevant features could be shared across few related
tasks. Most of the existing methods assume the extent to which the given
tasks are related or share a common feature space to be known apriori. In
real-world applications however, it is desirable to automatically discover
the groups of related tasks that share a feature space. In this paper we aim
at searching the exponentially large space of all possible groups of tasks
that may share a feature space. The main contribution is a convex formulation
that employs a graph-based regularizer and simultaneously discovers few
groups of related tasks, having close-by task parameters, as well as the
feature space shared within each group. The regularizer encodes an important
structure among the groups of tasks leading to an efficient algorithm for
solving it: if there is no feature space under which a group of tasks has
close-by task parameters, then there does not exist such a feature space for
any of its supersets. An efficient active set algorithm that exploits this
simplification and performs a clever search in the exponentially large space
is presented. The algorithm is guaranteed to solve the proposed formulation
(within some precision) in a time polynomial in the number of groups of
related tasks discovered. Empirical results on benchmark datasets show that
the proposed formulation achieves good generalization and outperforms
state-of-the-art multi-task learning algorithms in some cases. 

video: http://icml.cc/discuss/2012/90.html

pratik.j, saketh@cse.iitb.ac.in

[pdf]

}}

SUPERVISED LEARNING




@InProceedings{takeda-mitsugi-kanamori-12_unified-robust-classification,
  author =    {Akiko Takeda and Hiroyuki  Mitsugi and Takafumi  Kanamori},
 title =     {A Unified Robust Classification Model},
  booktitle = {Proceedings of the 29th International Conference on Machine Learning (ICML-12)},
  series =    {ICML '12},
  year =      {2012},
  editor =    {John Langford and Joelle Pineau},
  location =  {Edinburgh, Scotland, GB},
  isbn =      {978-1-4503-1285-1},
  month =     {July},
  publisher = {Omnipress},
  address =   {New York, NY, USA},
  pages=      {129--136},
  annote = {

good review of supervised classification + combine main algorithms

Abstract: A wide variety of machine learning algorithms such as support
vector machine (SVM), minimax probability machine (MPM), and Fisher
discriminant analysis (FDA), exist for binary classification. The purpose of
this paper is to provide a unified classification model that includes the
above models through a robust optimization approach. This unified model has
several benefits. One is that the extensions and improvements intended for
SVM become applicable to MPM and FDA, and vice versa. Another benefit is to
provide theoretical results to above learning methods at once by dealing with
the unified model. We give a statistical interpretation of the unified
classification model and propose a non-convex optimization algorithm that can
be applied to non-convex variants of existing learning methods. 

discussion+video : http://icml.cc/discuss/2012/87.html


[pdf]

}}



@inproceedings{maX-luoP-11ijc_combining-supervised-unsupervised-via-embedding,
 title={Combining supervised and unsupervised models via unconstrained probabilistic embedding},
  author={Xudong Ma and Ping Luo and Fuzhen Zhuang and Qing He and Zhongzhi Shi and Zhiyong Shen},
  booktitle={Proceedings of the 22nd IJCAI Volume Two},
  pages={1396--1401},
  year={2011},
  abstract = {

Unsupervised learning is used to improve the learning based on
several (conflicting) supervised learners.  Given a data set, an ensemble
categorization system traditionally consists of
several supervised learners assign some class IDs.  In the case of conflicts
one may use voting etc.  In this work, they try to use
several
unsupervised clustering algorithms, each of which create somewhat different
clusters.  The idea in this ensemble of learners is that the items in the
same unsup clusters should belong to the same final classes- as far as
possible.  this is used to tune the result of the supervised learning.

abstract:
Ensemble learning with output from multiple supervised and unsupervised
models aims to improvethe classification accuracy of supervised model
ensembleby jointly considering the grouping results from unsupervised
models. In this paper we cast this ensemble task as an unconstrained
probabilistic embedding problem. Specifically, we assume both objects and
classes/clusters have latent coordinates without constraints in a
D-dimensional Euclidean space, and consider the mapping from the embedded
space into the space of results from supervised and unsupervised models as a
probabilistic generative process. The prediction of an objectis then
determined by the distances between the objectand the classes in the embedded
space. A solution of this embedding can be obtained using the quasi-Newton
method, resulting in the objects and classes/clusters with high co-occurrence
weights being embedded close. We demonstrate the benefits of this
unconstrained embedding method by three real applications.

[pdf]

}}



@article{agarwal-daume-11ijc_geometric-view-of-conjugate-priors,
 title={A geometric view of conjugate priors},
  author={Arvind Agarwal and Hal Daum{\'e} III},
  journal={Machine learning},
  volume={81},
  number={1},
  pages={99--113},
  year={2010},
  publisher={Springer}
  abstract = {

In Bayesian machine learning, conjugate priors are popular, mostly due to
mathematical convenience. In this paper, we show that there are deeper
reasons for choosing a conjugate prior. Specically, we formulate the
conjugate prior in the form of Bregman divergence and show that it is the
inherent geometry of conjugate priors that makes them appropriate and
intuitive. This geometric interpretation allows one to view the
hyperparameters of conjugate priors as the eective sample points, thus
providing additional intuition. We use this geometric understanding of
conjugate priors to derive the hyperparameters and expression of the prior
used to couple the generative and discriminative components of a hybrid model
for semi-supervised learning.


[pdf]

}}



@inproceedings{xiaoY-liuB-11ijc_similarity-based-positive-and-unlabeled-learning,
 title={Similarity-based approach for positive and unlabelled learning},
  author={Yanshan Xiao and Bo Liu and Jie Yin and Longbing Cao and Chengqi Zhang and Zhifeng Hao},
  booktitle={Proceedings of the 22nd IJCAI Volume Two},
  pages={1577--1582},
  year={2011},
  abstract = {

Positive and unlabelled learning (PU learning) has been investigated to deal
with the situation where only the positive examples and the unlabelled
examples are available. Most of the previous works focus on identifying some
negative examples from the unlabelled data, so that the supervised learning
methods can be applied to build a classifier. However, for the remaining
unlabelled data, which can not be explicitly identified as positive or
negative (we call them ambiguous examples), they either exclude them from the
training phase or simply enforce them to either class. Consequently, their
performance may be constrained. This paper proposes a novel approach, called
similarity-based PU learning (SPUL) method, by associating the ambiguous
examples with two similarity weights, which indicate the similarity of an
ambiguous example towards the positive class and the negative class,
respectively. The local similarity-based and global similarity-based
mechanisms are proposed to generate the similarity weights. The ambiguous
examples and their similarity-weights are thereafter incorporated into an
SVM-based learning phase to build a more accurate classifier. Extensive
experiments on real-world datasets have shown that SPUL outperforms
state-of-the-art PU learning methods.


[pdf]

}}



@inproceedings{liYF-HuJH-12aaai_what-patterns-trigger-what-labels,
  title={Towards Discovering What Patterns Trigger What Labels},
  author={Yu-Feng Li and Ju-Hua Hu and Yuan Jiang and Zhi-Hua Zhou},
  booktitle={Twenty-Sixth AAAI Conference on Artificial Intelligence},
  year={2012},
  pages = {1012-1018},
  annote = {

multiple labels are associated with each object, with many overlaps.  
perhaps a label relates to some subset of features in each object.  
how does one create models of categories from this? 

formulate the problem as a convex optimization problem. 

abstract: 
In many real applications, especially those involving data objects with
complicated semantics, it is generally desirable to discover the relation
between patterns in the input space and labels corresponding to different
semantics in the output space. This task becomes feasible with MIML
(Multi-Instance Multi-Label learning), a recently developed learning
framework, where each data object is represented by multiple instances and is
allowed to be associated with multiple labels simultaneously. In this paper,
we propose KISAR, an MIML algorithm that is able to discover what instances
trigger what labels. By considering the fact that highly relevant labels
usually share some patterns, we develop a convex optimization formulation and
provide an alternating optimization solution. Experiments show that KISAR is
able to discover reasonable relations between input patterns and output
labels, and achieves performances that are highly competitive with many
state-of-the-art MIML algorithms.

[pdf]

}}


LANGUAGE / NLP




@article{collobert-weston-11_nlp-from-scratch,
 title={Natural language processing (almost) from scratch},
  author={Collobert, R. and Weston, J. and Bottou, L. and Karlen, M. and Kavukcuoglu, K. and Kuksa, P.},
  journal={The Journal of Machine Learning Research},
  volume={12},
  pages={2493--2537},
  year={2011},
  annote = {

abstract
We propose a unified neural network architecture and learning algorithm that
can be applied to various natural language processing tasks including
part-of-speech tagging, chunking, named entity recognition, and semantic role
labeling. This versatility is achieved by trying to avoid task-specific
engineering and therefore disregarding a lot of prior knowledge. Instead of
exploiting man-made input features carefully optimized for each task, our
system learns internal representations on the basis of vast amounts of mostly
unlabeled training data. This work is then used as a basis for building a
freely available tagging system with good performance and minimal
computational requirements.

--------------------------------------------------------------------------------
Task    Benchmark	Data set	Training set 	Test set
					(#tokens) 	(#tokens) 	(#tags)
POS Toutanova etal (2003) WSJ   	sections 0–18 	sections 22–24 	( 45 )
    	      	     	    		 ( 912,344 ) 	( 129,654 )
Chunking CoNLL 2000 	WSJ 		sections 15–18 	section 20 	( 42 )
					( 211,727 ) 	( 47,377 ) 	(IOBES)
NER    	CoNLL 2003 	Reuters		“eng.train” 	“eng.testb” 	( 17 )
    	       				( 203,621 ) ( 46,435 ) (IOBES)
SRL 	CoNLL 2005 	WSJ 		sections 2–21	section 23	( 186 )
    	       				( 950,028 )   +3 Brown sections (IOBES)
					  	      ( 63,843 )
--------------------------------------------------------------------------------


[pdf]

}}



@article{chen-mooney-11_interpret-NLP-navigation-instructions-from-observations,
 title={Learning to interpret natural language navigation instructions from observations},
  author={Chen, David L. and Mooney, Ray J.},
  journal={Association for the Advancement of Artificial Intelligence (AAAI), Cambridge, MA},
  year={2011},
  annote = {

get robots to follow instructions such as
    Go towards the coat hanger and turn left at it.  Go straight down the
    hallway and the dead end is position 4.

videos + plus human instructions and instruction-follosing navigation

dataset collected by Ben Kuipers
Matt mcmann

6 instructors
3 maps
1-15 followers for each direction
706 total instructions
10 actions

?project?
1. get GIVE corpus Giving Instructions in Virtual Environments
2. work on mapping nouns?
http://www.give-challenge.org/research/page.php?id=give-2-corpus

---

2013:  A.U.S.S Pradeep(10001), ~deepu/cs365/hw2/Review.pdf

The Main objective is to build a system that can interpret and follow
free-form Natural – language instructions to move to a desired location based
on the way humans follow sample instructions, with no prior linguistic
knowledge: syntactic, semantic, or lexical.

Introduction:
Formally, the system is given training data in the form:{(e1,a1,w1),
(e2,a2,w2),...,(en,  an,  wn)}, where ei is a natural language instruction,
ai is an observed action sequence, and wi is a description of the current
state of the world including the patterns of the floors and walls and
positions of any objects. The goal is then to build a system that can produce
the correct aj given a previously unseen (ej, wj) pair. But for the purpose
of training and testing the data and environments of three different virtual
worlds consisting of interconnecting hallways is used.

Contributions:

Given a particular position (ei, ai, wi ),a particular plan pi is designed 
based on the action sequence ai and the understanding of the state 
of the world wi .  The pair (ei, pi ),is then used as learning data for a 
‘Semantic Parser’.  During testing, the parser maps new instructions 
ejto formal navigation plans pj which is carried out by the execution 
module (MARCO).

Marco, was an agent that follows free-form, natural 
language route instructions by representing and executing 
a sequence of compound action specifications that model 
which actions to take under which conditions.
The paper describes about the difference between 
‘Basic plan’ approach and ‘Landmarks plan’.
To remove the extraneous information they employ a 
‘Plan Refinement’, which first learns the meaning of 
short phrases and words and uses the learned 
‘Lexicon’ to remove parts of plans unrelated to 
instructions.  The algorithm collects all plans ‘g’ that occur along with a phrase ‘w’ and takes intersections of all possible 
pairs of meanings. Further it ranks the entries by scoring functions.
Intuitively, the score measures how much more likely a graph ‘g’ occurs when ‘w’ is present compared to 
when it is not present.

For the ‘Learning Semantics Parser’ KRISP (Kernel-based Robust Interpretation for Se-mantic Parsing) 
(Kate and Mooney, 2006) is used which is a supervised learning system for semantic parsing which takes 
NL sentences paired with their MRs as training data .  KRISP trains the classifiers used in semantic parsing 
iteratively.

Experimental Results:

By testing how the system infers the correct navigation plans 
using partial parse accuracy as metric. Compared to basic plans,
landmark plans have better recall but considerably lower 
precision. However the lexicon refined plans retain both high precision and high recall.
And by testing the end-to-end execution of how well the system can perform the overall navigation task 
on a strict metric of successful finishes, the 
following results are obtained.
The better performance of the 
basic plans on the single-sentences task 
shows that for these shorter instructions, 
directly modeling the low-level actions is 
often sufficient.
Landmarks are useful in recovering 
from small mistakes in parsing and hence 
the system using ‘refined landmark plans’
performed better among the first three models for complete 
instructions test case.
MARCO was fully manually engineered for this environment and 
hand-tuned on this data to achieve the best performance.

Conclusion:

This paper showed a novel system that learns a semantic parser for 
interpreting navigation instructions by simply observing the actions of 
human followers without using any prior linguistic knowledge or direct 
supervision.
But one of its shortcomings was that a mistake committed in earlier 
stage propagated to the later stages and hence a built in feedback loop 
working iteratively can improve the performance even more.

References:

[1]"Semi-Supervised Learning for Semantic Parsingusing Support 
Vector Machines”- Rohit J. Kate and Raymond J. Mooney.
[2]Matt MacMahon, Brian Stankiewicz &Benjamin Kuipers
'Walk the talk: Connecting language, knowledge, and action in route instructions.'' National Conference on 
Artificial Intelligence (AAAI-06).
[3]‘Learning to Interpret Natural Language Navigation Instructions from Observations’ by David L.Chen,
http://www.cs.utexas.edu/users/ml/clamp/navigation/

---

Abstract:

The ability to understand natural-language instructions is critical to
building intelligent agents that interact with humans. We present a system
that learns to transform natural-language navigation instructions into
executable formal plans. Given no prior linguistic knowledge, the system
learns by simply observing how humans follow navigation instructions. The
system is evaluated in three complex virtual indoor environments with
numerous objects and landmarks. A previously collected realistic corpus of
complex English navigation instructions for these environments is used for
training and testing data. By using a learned lexicon to refine inferred
plans and a supervised learner to induce a semantic parser, the system is
able to automatically learn to correctly interpret a reasonable fraction of
the complex instructions in this corpus. We will also present recent results
on using syntax (part-of-speech tags) to improve semantic-lexicon learning
and using an unsupervised PCFG approach to learn from the highly ambiguous
supervision provided in the navigation task.


[pdf]

}}

MACHINE LEARNING in NLP




@inproceedings{naradowsky-goldwater-09_improving-morphology-induction-via-spelling,
 title={Improving morphology induction by learning spelling rules},
  author={Naradowsky, J. and Goldwater, S.},
  booktitle={UCAI 2009, Proceedings of the 21st, International Joint Conference on Artificial Intelligence},
  pages={11--17},
  year={2009}

morphological particles like "-s" may be added after (or before) words to
serve grammatical functions like pluralization.  Detecting such structures
via machine learning is an important approach for language processing.

abstract:

Unsupervised learning of morphology is an important task for human learners
and in natural language processing systems. Previous systems focus on
segmenting words into substrings (taking⇒ tak.ing), but sometimes a
segmentation-only analysis is insufficient (e.g., taking may be more
appropriately analyzed as take+ing, with a spelling rule accounting for the
deletion of the stem-final e). In this paper, we develop a Bayesian model for
simultaneously inducing both morphology and spelling rules. We show that the
addition of spelling rules improves performance over the baseline
morphology-only model.

??project - apply to hindi?


[pdf]

}}



@article{bod-smets-12_empiricist-approach-to-nativism-via-unsupervised-TSG,
 title={Empiricist solutions to nativist puzzles by means of unsupervised TSG},
  author={Bod, R. and Smets, M.},
  journal={EACL 2012},
  pages={10},
  year={2012}
  annote = {

uses the tree-based parsing strategy - to parse a body of
sentences directed at a child (Adam sub-corpus of the CHILDES corpus) argues
that a child may be learning grammar not because it is born with that ability
(nativism) but because it is part of broader general abilities.

TSG = tree-substitution grammar [as opposed to linear parse tree]

? project ?  : TSG / DOP software is downloadable

---review 2013 Lalit Kumar ~lalitkr/cs365/hw2/Assignment2.pdf

Abstract :
This paper addresses problems related to wh-questions, relative clause formation, topicalization, extraposition and left 
dislocation.

These hard cases can be learned by a simple unsupervised grammar induction algorithm that returns the sentence with the 
best-ranked derivation for a particular phenomenon, using only a very small fraction of the input a child receives.

The nativist view - there is an innate language-specific component; human language acquisition is guided by innate rules and
constraints.

The empiricist view - there is no language-specific component; language acquisition is the product of abstractions from 
empirical input by means of general cognitive capabilities.

The Argument from the Poverty of the Stimulus (Pullum & Scholz 2002 for a detailed discussion):

(i) Children acquire a certain linguistic phenomenon
(ii) The linguistic input does not give enough evidence for acquiring the
	phenomenon 
(iii) There has to be an innate component for the phenomenon

This paper falsifies step (ii) i.e. even if a linguistic phenomenon is not in
a child’s input, it can be learned by an “ideal‟ learner from a tiny fraction
of child directed utterances, namely by combining fragments from these
utterances using the Adam corpus in the Childes database (MacWhinney 2000).

Methodology: Induced Tree-Substitution Grammar (TSG)
Two TSG derivations, resulting in different
parse trees, for the sentence She saw the dress
with the telescope
Grammar induction algorithm: Zollmann & Sima‟an 2005.
Given a corpus of sentences:

*  Divide the corpus into a 50% Extraction Corpus (EC) and a 50% Held out
	Corpus (HC). 
*  Assign all unlabelled binary trees to the  sentences in EC and store them
	in a parse forest. 
*  Convert the subtrees from the parse forests into  a compact PCFG reduction
	(Goodman 2003). 
*  Compute the k-shortest derivations for the sentences in HC using the PCFG
	reduction. 
* Compute the best-ranked derivation for each sentence by the sum of the
	ranks of the subtrees (where the most frequent subtrees get rank 1,
	next most frequent subtrees get rank 2, etc., thus the best-ranked
	derivation is the one with the lowest total score).
* Use the subtrees in the trees generated by the best-ranked derivations to
   form the TSG

Convention

When induced TSG parse sentence we obtain derivation consisting of subtrees
(operation „o‟ stands for leftmost node substitution of TSG-subtrees).  The
unlabelled subtrees represented by squared brackets, and POS-tags substituted
with the
words.

Some Constraints and terms: 

Islands: Certain syntactic structures are islands; i.e, they are effectively
  isolated from the rest of the sentence they are in, for the purpose of
  various kinds of rules, typically those which involve licensing the
  appearance of sentence elements in unexpected places in the structure,
  while suppressing them in their "original" (i.e, expected)
  position. (J. R. Ross (1967), Constraints on Variables in Syntax)

Complex NP constraint: A Complex NP is one with a noun head and a modifying
  clause (J. R. Ross (1967), Constraints on Variables in Syntax)

*  Relative Clauses, which always contain a noun coreferential to the head noun:
	 [ the dog_i[(which_i) Mary saw ___i] ]
	   NP S S NP

*  NP Complements, which are only possible where the head noun is a picture noun, i.e, a noun denoting a symbolic 
representation like picture, story, rumor, etc; the complement denotes the content:
* [ the report_i [that Quayle sleeps with a Teddy Bear] ]
	NP S_i S_i NP

Wh-question: We use question words to ask certain types of questions (question word questions). We often refer to them as 
WH words because they include the letters WH .
For example WHy, HoW

Relative Clause Formation: We use relative clauses to give additional information about something without starting another 
sentence. By combining sentences with a relative clause, your text becomes more fluent and you can avoid repeating certain 
words.

Example: A girl is talking to Tom. Do you know the girl?
	 Do you know the girl who is talking to Tom?

Topicalisation: Topicalization is a mechanism of syntax that establishes an expression as the sentence or clause topic by 
having it appear at the front of the sentence or clause (as opposed to in a canonical position further to the right).

Example: a. I won't eat that pizza.
	 b. That pizza, I won't eat. 

Topicalization of the object argument _that pizza_

Extraposition: Extraposition is a mechanism of syntax that alters word order in such a manner that a relatively 
"heavy" constituent appears to the right of its canonical position

Example: a. Someone who we don't know left a message.
b. Someone left a message who we don't know. - Extraposition of relative clause out of subject

Left Dislocation: Here the constituent (which could otherwise be either an
argument or an adjunct of the clause) is advanced. It can be used to
emphasize or define a topic.

For example: The sentence This little girl, the dog bit her has the same meaning as The dog bit this little girl but it emphasizes 
that the little girl (and not the dog) is the topic of interest;


--Problems addressed by Paper | explained from examples--

The problem of wh-questions : usually dealt with by a specific set of “island
constraints”, where islands are constituents out of which wh-elements cannot
move.

Subject wh-questions-Example: To explain how children know that (a) is the
grammatical way of asking the particular question, and (b),(c) and (d) are
not

(a) who kissed Bella - [who X] o [kissed Bella] ranking: 22 + 6,694 = 6,716
(b) *kissed who Bella-[X Bella] o [kissed who] ranking: 24 + 6,978 = 7,002
(c) *did who kiss Bella-[did X Bella] o [who kiss] ranking: 4,230 + 8,527 = 12,757
(d) *who did kiss Bella.-[X kiss Bella] o [who did] ranking: 4,636 + 2,563 = 7,199

The problem of Relative clause formation – closely related to wh-question

Example:

a.) the vampire who I read a book about is dangerous--the „moved‟ phrase `the
vampire' is taken out of the non-complex NP `a book about 

b.) *the vampire who I read a book which was about is dangerous--`the
vampire' is moved‟ out of the complex NP `a book which was about ‟, which is not allowed.

(a.) [the vampire X is dangerous] o [who I read X] o [a book about] ranking:
1,714,906

(b.) [the vampire X is dangerous] o [who I read X] o [a book which X] o [was
about] ranking: 1,906,597

The problem of Extraposition from NP

Example:

(a.) that Jacob picked Bella up who loves Edward is possible
(b.) *that Jacob picked Bella up is possible who loves Edward
(a‟) [X is possible] o [that Jacob picked X] o[Bella up X] o [who loves Edward] ranking: 1,080,535
(b‟) [X is possible X] o [that Jacob picked X] o[Bella up] o [who loves Edward] ranking: 1,111,155

The problem of Topicalization: Example, the topicalization in (a) is fine but in (b) it is not.

(a) Stephenie's book I read [X I read] o [Stephenie‟s book] ranking: 608 + 2,784 = 3,392

(b) *Stephenie's I read book [Stephenie‟s X book] o [I read] ranking: 3,139 +
488 = 3,627

The problem of Left dislocation: The phenomenon of Left dislocation provides a particular challenge to nativist approaches 
since it shows that there are grammatical sentences that do not obey the Coordinate Structure Constraint (Adger 2003; 
Borsley 2004).

In Left dislocation the moved constituent must be moved to the left of the main clause. Instead, movement merely to the left 
of a subordinate clause results in an ungrammatical sentence.(Ross 1967)

Example:

(a) Edward, that you love him is obvious -- grammatical, `Edward' is moved to the left of the main clause.

(b) *that Edward, you love him is obvious--ungrammatical, `Edward' is only moved to the left of the subordinate
clause `that you love '.

TSG has no problem in distinguishing between these two alternatives, as is shown below

(a‟) [Edward X is obvious] o [that you love him] ranking: 590,659 + 57,785 = 648,444
(b‟) [that X is obvious] o [Edward you love him] ranking: 876,625 + 415,940 = 1,292,565

Conclusion: 

Unsupervised TSG can capture virtually all phenomena related to wh-questions
in a simple and uniform way .  The questionfrom where the preferences may
eventually come? should be answered within the field of language evolution.
The model only ranks alternative sentences (for a certain phenomenon)
,ideally rather than selecting from given sentences, we would like to have a
model that starts with a certain meaning representation for which next the
best sentence is generated. In the absence of such a semantic component, the
model selects directly from the set of possible sentences as they are
provided in the literature as alternatives.

Furthermore, this model can be extended to cover other phenomena that fall
out of the scope of the traditional nativist account .  Thus it is an
alternative approach that can model all the hard phenomena on the basis of
just one principle (the best-ranked derivation), for at least these
phenomena, Arguments from Poverty of Stimulus can no longer be invoked.
Hence argument (of the Poverty of the Stimulus) is refuted by this model.

Acknowledgement:
http://www-personal.umich.edu/~jlawler/aue/ross.html
http://www.ego4u.com/en/cram-up/grammar/relative-clauses
http://en.wikipedia.org/wiki/Topicalization
http://en.wikipedia.org/wiki/Extraposition
http://en.wikipedia.org/wiki/Dislocation_(syntax)


---

abstract: 
While the debate between nativism and empiricism exists since several
decades, surprisingly few common learning problems have been proposed for
assessing the two opposing views. Most empiricist researchers have focused on
a relatively small number of linguistic problems, such as Auxiliary Fronting
or Anaphoric One. In the current paper we extend the number of common test
cases to a much larger series of problems related to wh-questions, relative
clause formation, topicalization, extraposition from NP and left
dislocation. We show that these hard cases can be empirically solved by an
unsupervised tree-substitution grammar inferred from child-directed input in
the Adam corpus (Childes database).

[pdf]

}}



@InProceedings{matuszek-fitzgerald-zettlemoyer-12icml_joint-language-and-perception-learning,
  author =    {Cynthia Matuszek and Nicholas FitzGerald and Luke Zettlemoyer and Liefeng Bo and Dieter Fox},
 title =     {A Joint Model of Language and Perception for Grounded Attribute Learning},
  booktitle = {Proceedings of the 29th International Conference on Machine Learning (ICML-12)},
  series =    {ICML '12},
  year =      {2012},
  editor =    {John Langford and Joelle Pineau},
  location =  {Edinburgh, Scotland, GB},
  isbn =      {978-1-4503-1285-1},
  month =     {July},
  publisher = {Omnipress},
  address =   {New York, NY, USA},
  pages=      {1671--1678},

discovers language and vision labels for colours and object shapes. 
?project?

--- 2013 Vedant mishra /~vedant/cs365/hw2/report.pdf

This paper present an approach for joint learning of language and perception models for grounded attribute induction. The perception model includes learning 
different physical characteristics like colour,shape and a language model includes learning representations of meanings of natural languages.Joint model is 
the combination of these 2 models. It first relates all the logical meanings of the sentences with the classifiers. Then using the classifier,it determines
what sort of objects would be selected.   

For eg. if given a sentence "Pick all the yellow blocks", then the robot analyzes the complete sentenses using semantic analysis model and represents 
the meanings of different words .It then classify the appropriate blocks( in our case all the yellow blocks). Thus the required set of object is the set of 
all yellow blocks as shown in the right side of Figure 1. 

\vspace{80pt}


\begin{figure}[h!]
  \centering
    \includegraphics[scale=0.08,trim= 0mm 0mm 0mm 5mm]{ai.png}
  \caption{Taken from ref[1]}
\end{figure}

\begin{equation}
\end{equation}

\newpage
\section{Overview of Methodology}

The language and the perception models used in the learning process includes 
\begin{itemize}

\item {\bf A semantic parsing (language model) }: FUBL algorithm is used for semantic parsing which gives logical meaning(usually high order functions like lamda calculus)
to the sentences. This algorithm is used for learning factored Combinatory Categorial Grammar. FUBL parse the trees and creates a log linear model which
yields the probability of the parse.For eg. given a sentence " Pick all the yellow blocks", algorithm parse the tree and gives logical meaning to the words 
like pick,yellow etc. 

\item {\bf Visual attributes classifier (perception model)} : These includes sets of classifier. Each classifier returns true for all possible objects in 
object scene.For eg. given a sentence "Pick all the yellow blocks", the robot after getting the logical meaning of the words like yellow, it actually states true for  
all yellow blocks from the given set of objects.

\item {\bf Joint model is the combination of the above 2 models}.Given a natural language sentence x and a given set of scene objects O , this model identifies subset G of 
object described by O . We evaluate w as set of possible classifier output and z as all the possible logical forms of a sentence.

\vspace{15pt}

\hspace{100pt}P (G$|$x,O) = $\displaystyle\sum_{x} \sum_{w}$ P (G,z,w$|$x,O)

\vspace{7pt}$\Rightarrow$ \hspace{80pt} P(G,z,w$|$x,O) = P(z$|$x) * P (w$|$O) * P(G$|$z, w)

\vspace{7pt}

Here P(z $|$ x) defines the language model, P(w $|$ O) defines the vision model and P(G$|$z,w) defines conditional probability. For finding that which set does
the object belongs we evaluate arg max$_G$ P(G$|$x,O). This gives us the maximum probability set to which that object belongs to.

\vspace{15pt}Model Learning process is done in 2 phases
\hspace{50pt}
\begin{itemize}
\item {\bf Initialization phase}
We will use a small,supervised data to construct initial language and
perceptual model. This contains basic words or objects. For any new or unseen
word,we use second phase.

\item {\bf Assigning unseen words to classifier} : We create a new classifier
for each unseen word/object. For eg. for a new word "colour blue", we will
create a new classifier. A new lexeme is created by combining the unseen word
with a logical constant. Adding a new classifier would result in reestimating
of language and perception parameter. For reestimating the parameters, we use
expected maximization(EM) algorithm in which we maximize the marginal
probabilities.
\end{itemize}

\end{itemize}

\section{Experiments and Results (Taken From ref[1])}

Two features of every objects are extracted from the given set of
objects(input) using kernel descriptors.First is the depth value that
corresponds to the shape and second is RGB value for the colour. Thus we
classify the objects on the basis of their shape and colour. Thus a red
triangular block is classified in different group than blue square block.

In order to measure the task performance, we train the input data using 6
objects. The other 6 objects are used as test data set. The system performs
well with an average precision of 82\%.

In order to find the need of joint model, performance of vision, language and
joint model are measured and its result is shown in Table 1 ([1]). The table
tells us that the precision,recall and F1 Score of the joint model is greater
than either of the language or vision(perception) model.

\vspace{20pt}

\newpage

\begin{table}
\hspace{170pt}\begin{tabular}{|c||ccc||r|}
	\hline
\textbf  &  Precision   &   Recall  & F1 Score \\
	\hline \hline
Vision   & 0.92 & 0.41 & 0.55 \\
Language & 0.52 & 0.09 & 0.14  \\
Joint    & 0.82 & 0.71 & 0.76   \\
	\hline
\end{tabular}
\caption {Results of Experiments (Taken from [1])}
\end{table}

---


Abstract: As robots become more ubiquitous and capable, it becomes ever more
important to enable untrained users to easily interact with them. Recently,
this has led to study of the language grounding problem, where the goal is to
extract representations of the meanings of natural language tied to
perception and actuation in the physical world. In this paper, we present an
approach for joint learning of language and perception models for grounded
attribute induction. Our perception model includes attribute classifiers, for
example to detect object color and shape, and the language model is based on
a probabilistic categorial grammar that enables the construction of rich,
compositional meaning representations. The approach is evaluated on the task
of interpreting sentences that describe sets of objects in a physical
workspace. We demonstrate accurate task performance and effective
latent-variable concept induction in physical grounded scenes.

discussion+video: http://icml.cc/discuss/2012/822.html

[pdf]

}}



@inproceedings{deCruys-korhonen-12coling_tensor-factorization-for-unsup-lexical-acquisition,
 title =	 {Multi-way Tensor Factorization for Unsupervised Lexical
                  Acquisition},
  authors =	 {Tim Van de Cruys, Laura Rimell, Thierry Poibeau and Anna
                  Korhonen},
  booktitle =	 {Proceedings of the 24rd International Conference on
                  Computational Linguistics},
  series =	 {COLING '12},
  month =	 {December},
  year =	 {2012},
  location =	 {Mumbai, India},
  annote =  {

Learns verb argument structures (subcategorization frames) via SVD
factorization of higher-dimensional matrices.

Verb subcategorization frame (SCF) induction involves identifying the
arguments of a verb lemma in a corpus, and generalizing about the frames
taken by the verb, where each frame includes a number of arguments and their
syntactic types. Consider e.g. sentence (1), where the verb show takes the
frame SUBJ-DOBJ-CCOMP (subject, direct object, and clausal complement).

(1) [Our October review]  comprehensively 	[shows]  	[you]
    SUBJ		  		  	VERB	    	DOBJ
	[what’s in store in next month’s magazine]
	CCOMP

Predicting the set of SCFs for a verb can be viewed as a multi-way
co-occurrence problem of a verb and its different arguments. One of the main
challenges is distinguishing arguments from adjuncts (e.g. temporal,
locative, or manner modifiers). Most SCF induction work to date considers only
the co-occurrences of verb lemmas with different grammatical relation types
(subject, object, prepositional phrase, etc.). Taking SCF acquisition to the
next level requires consideration of the lexical fillers of potential argument
slots for more accurate argument-adjunct discrimination.

--Selectional Preference--
Selectional preference (SP) induction involves predicting the likelihood of a
given lexical item occurring in an argument slot, and generalizing about the
lexical classes which occur in the slot, which may be dependent on the
SCF. In sentence (2), for example, the verb show takes the frame SUBJ-DOBJ,
and the direct object of show in this frame is likely to be inanimate.

(2) [Stalin], who must have been well informed through his network of spies,
    SUBJ
	[showed] 	[no emotion].
	VERB		DOBJ

[show]: likely prefers and inanimate DOBJ

?project? write?

},
  abstract = {

This paper introduces a novel method for joint unsupervised aquisition of
verb subcategorization frame (SCF) and selectional preference (SP)
information. Treating SCF and SP induction as a multi-way co-occurrence
problem, we use multi-way tensor factorization to cluster frequent verbs from
a large corpus according to their syntactic and semantic behaviour. The
method extends previous tensor factorization approaches by predicting whether
a syntactic argument is likely to occur with a verb lemma (SCF) as well as
which lexical items are likely to occur in the argument slot (SP), and
integrates a variety of lexical and syntactic features, including
co-occurrence information on grammatical relations not explicitly represented
in the SCFs. The SCF lexicon that emerges from the clusters achieves an
F-score of 68.7 against a gold standard, while the SP model achieves an
accuracy of 77.8 in a novel evaluation that considers all of a verb’s
arguments simultaneously.

[pdf]

}}




@inproceedings{wu-kashioka-12coling_factored-language-model-from-recurrent-NN,
 title =	 {Factored Language Model based on Recurrent Neural Network},
  author = 	 {Youzheng Wu and Xugang Lu and Hitoshi Yamamoto and Shigeki
	  Matsuda and Chiori Hori and Hideki Kashioka},
  booktitle =	 {Proceedings of the 24rd International Conference on
                  Computational Linguistics},
  series =	 {COLING '12},
  month =	 {December},
  year =	 {2012},
  location =	 {Mumbai, India},
  annote = {

words as “prices" and “price" and “developed" and “developing" are
different on surface; one way of matching them is
"stemming" - this takes the first few sounds from the surface form based on
some heuristics.  it works only in mildly agglutinated languages.

for all language types, this paper considers words as a bundle of factors :
surface form, stem, lemma, POS, gender marker, case marker, number marker
etc.  Incorporating these in many tasks improves performance in most tasks
using RNN based language models.

?project?
IDEA: may be useful for indian languages, and possibly in context other than
RNN language models.

},
  abstract = {
Among various neural network language models (NNLMs), recurrent neural
network-based language models (RNNLMs) are very competitive in many
cases. Most current RNNLMs only use one single feature stream, i.e., surface
words. However, previous studies proved that language models with additional
linguistic information achieve better performance. In this study, we extend
RNNLM by explicitly integrating additional linguistic information, including
morphological, syntactic, or semantic factors. Our proposed RNNLM is called a
factored RNNLM that is expected to enhance RNNLMs. A number of experiments
are carried out that show the factored RNNLM improves the performance for all
considered tasks: consistent perplexity and word error rate (WER)
reductions. In the Penn Treebank corpus, the relative improvements over
n-gram LM and RNNLM are 29.0% and 13.0%, respectively. In the IWSLT-2011 TED
ASR test set, absolute WER reductions over RNNLM and n-gram LM reach 0.63 and
0.73 points.


[pdf]

}}



@inproceedings{socher-linCC-NgA-11_parsing-natural-scenes-w-RNNs,
 title={Parsing natural scenes and natural language with recursive neural networks},
  author={Socher, R. and Lin, C.C. and Ng, A.Y. and Manning, C.D.},
  booktitle={Proceedings of the 26th International Conference on Machine Learning (ICML)},
  volume={2},
  pages={7},
  year={2011},
  abstract = {

---

This paper basically introduces recursive neural networks for predicting
recursive structures in multiple modalities such as the case of natural scene
images or natural language sentences. This algorithm also helps to, aside
from identification, understand the interaction of different sections of
images to form a whole scene.

Basic Methodology

Images are divided into small regions to form several segments with features
(like color, texture etc.) associated with them [1], [2]. These features are
mapped into a semantic space using a neural network. Using these semantic
region representations and adjacency matrix as input RNN computes

(i) A score that support neighboring segments being merged to form larger
	segments. 
(ii) Modified semantic space and adjacency matrix for the new merged region.
(iii) Label of each node (defines the object categories such as building or
	street based on the training results).

The model is trained, using Max margin estimators [3], [4] and greedy
approach, so that the score is high when neighboring regions have the same
class label. After regions with the same object label are merged, neighboring
objects are merged to form the full scene image. These merging decisions can
be defined as a tree structure in which each node has associated with it the
labels and scores, and higher nodes represent increasingly larger elements of
the image. The tree with highest cumulative score is taken as the
representation of the input image.

The same algorithm is used to parse natural language sentences. Words are
used in place of image segments[5][6]. They are merged into phrases in a
syntactically and semantically meaningful order. The class labels are phrase
types such as noun phrase (NP) or verb phrase (VP).

Contribution

This paper has introduced a RNN to successfully merge image segments or words
using recursively learned parse representations. It is the first recursive
learning method to achieve state-of-the-art results on segmentation and
annotation of complex scenes. This RNN architecture outperforms other methods
that are based on conditional random fields or combinations of other
methods. This method with a success rate of 78.1% in scene understanding
outperforms previous other results (maximum 77.5%) reported on same Stanford
data. With an accuracy of 88.1%, this process outperforms the Gist
descriptors (Aude & Torralba, 2001), for scene categorization, which obtain
only 84.0%. This algorithm is general in nature and can also parse natural
language sentences obtaining competitive performance with the widely used
Berkeley parser on the Wall Street Journal dataset.

References 
Original Paper:
Parsing Natural Scenes and Natural Language with Recursive Neural Networks
Richard Socher Cliff Chiung-Yu Lin Andrew Y. Ng Christopher D. Manning
[1]Comaniciu, D. and Meer, P. Mean shift: a robust approach toward feature space analysis. IEEE PAMI, 
24(5):603– 619, May 2002.
[2]Gould, S., Fulton, R., and Koller, D. Decomposing a Scene into Geometric and Semantically Consistent Regions. 
In ICCV, 2009.
[3]Taskar, B., Klein, D., Collins, M., Koller, D., and Manning, C. Max-margin parsing. In EMNLP, 2004.
[4]Manning, C. D. and Schu¨tze, H. Foundations of Statistical Natural Language Processing. The MIT Press, Cambridge, Massachusetts, 1999.
[5]Bengio, Y., Ducharme, R., Vincent, P., and Janvin, C. A neural probabilistic language model. JMLR, 3, 2003.
[6]Collobert, R. and Weston, J. A unified architecture for natural language processing: deep neural networks with 
multitask learning. In ICML, 2008.
 

---

Recursive structure is commonly found in the inputs of different modalities
such as natural scene images or natural language sentences.  Discovering this
recursive structure helps us to not only identify the units that an image or
sentence contains but also how they interact to form a whole. We introduce a
max-margin structure prediction architecture based on recursive neural
networks that can successfully recover such structure both in complex scene
images as well as sentences. The same algorithm can be used both to provide a
competitive syntactic parser for natural language sentences from the Penn
Treebank and to outperform alternative approaches for semantic scene
segmentation, annotation and classification.  For segmentation and annotation
our algorithm obtains a new level of state-of-theart performance on the
Stanford background dataset (78.1%). The features from the image parse tree
outperform Gist descriptors for scene classification by 4%.


[pdf]

}}



@inproceedings{wang-mahadevan-11ijc_heterogeneous-domain-manifold-alignment,
 title={Heterogeneous domain adaptation using manifold alignment},
  author={Chang Wang and Sridhar Mahadevan},
  booktitle={Proceedings of the 22nd IJCAI Volume Two},
  pages={1541--1546},
  year={2011},
  annote = {

domain adaptation: use knowledge from domains 1..K to solve a similar problem
in domain K+1 which may not have similar resources etc.
e.g. given two labelled collections of documents in English, Italian, can we
label a collection of  Hindi documents?  If the feature spaces for the
documents are not the same, this is very difficult.  Here the idea is to map
the different domains to a latent space where every class is a manifold, and
then try to align the manifolds by structure to capture the similarities.

Try to define a similarity matrix so that these will be very similar
only when the local geometries of the two data sets are roughly the same.

abstract:
We propose a manifold alignment based approach for heterogeneous domain
adaptation. A key aspect of this approach is to construct mappings to link
different feature spaces in order to transfer knowledge across domains. The
new approach can reuse labeled data from multiple source domains in a target
domain even in the case when the input domains do not share any common
features or instances. As a pre-processing step, our approach can also be
combined with existing domain adaptation approaches to learn a common feature
space for all input domains. This paper extends existing manifold alignment
approaches by making use of labels rather than correspondences to align the
manifolds. This extension significantly broadens the application scope of
manifold alignment, since the correspondence relationship required by
existing alignment approaches is hard to obtain in many applications.

(see also AAAI-12 paper by Vu & Mahadevan:
Manifold Warping: Manifold Alignment over Time


[pdf]

}}

LANGUAGE GENERATION




@inproceedings{ontanon-zhu-11ijc_domain-knowlwedge-story-generation,
 title={On the role of domain knowledge in analogy-based story generation},
  author={Santiago Onta{\~n}{\'o}n and Jichen Zhu},
  booktitle={Proceedings of the 22nd IJCAI Volume Two},
  pages={1717--1722},
  year={2011},
  abstract = {


---2013 review Niraj Kumar ~nirajkr/cs365/hw2  (pls rename to hw2.pdf)

AI techniques to algorithmically analyze, understand and generate stories is
a complex field.  Domain Knowledge of the AI system is very important for
such a task. Talespin was the first story generation system in which more and
more domain knowledge was added to fix the stories created. The authors of
the paper have used an analogy-based story generaton system – Riu.

Story generation system explores the ways in which stories can be generated
algorithmically. It includes natural language selection, character modeling,
common sense and reasoning. The focus of the paper is on Computational
Analogy story generation. A story generation system generates stories about a
target domain T by drawing analogy with a source domain S. The theoretical
framework of the paper is based on Ritcher’s knowledge container framework
which uses knowledge contained in case-based learning (CBR). There are four
containers in this system and is used by Riu.

(i)Vocabulary :- It contains the important details and relations used to
represent primitive notions. It focuses on the important concepts and their
relations. Large amount of specific domain knowledge system are involved.

(ii)Similarity Measure :- To each pair of stories, assign a positive number
in the range [0,1] which indicates the similarity between them. Given two
keywords, counts number of related keyword in WordNet.

(iv)Story Repository (Case Base) :- It is the respository having the
collection of many story scenes. The quality of upcoming stories increases as
the amount of data increases.

The Riu system is a text-based interactive system which uses analogy to
generate stories about a robot character Ales, who has lost his memory. Ales
imagines the consequence of each action using his previously stored
information.

The study identifies areas where increasing domain knowledge may lead to
significant improvements. The task of authoring more stories can increase the
variety of the stories but does little to improve the quality.

---

Computational narrative is a complex and interesting domain for exploring AI
techniques that algorithmically analyze, understand, and most importantly,
generate stories. This paper studies the importance of domain knowledge in
story generation, and particularly in analogy-based story generation
(ASG). Based on the construct of knowledge container in case-based reasoning,
we present a theoretical framework for incorporating domain knowledge in
ASG. We complement the framework with empirical results in our existing
system Riu.


[pdf]

}}



@article{wang-li-ogihara-12_generating-storylines-picture-text,
 title={Generating Pictorial Storylines via Minimum-Weight Connected Dominating Set Approximation in Multi-View Graphs},
  author={Dingding Wang and Tao Li and Mitsunori Ogihara},
  journal={Proceddings of AAAI 2012},
  year={2012},
  abtract={

---review 2013 Vaibhav 10780

Generating Pictorial Storylines via Minimum-Weight 
Connected Dominating Set Approximation in Multi-View  Graphs

This paper gives a very accurate solution to a very subjective question of 
generating pictorial storylines based on some texts, pictures and queries.
Existing research in this field has come up with fairly good accurate solutions
when it comes to summarizing a given text alone or just making a timeline out 
of some given texts.

Here there is a discussion on how to take this a step further where the 
summary will be based on both texts and images and also they will be 
represented as a timeline. This is more intuitive to us as images and texts 
together have more impact as compared to text alone. Basically to do this the 
algorithm takes as input some relevant piece of text and images along with 
their timestamps. Then this information is represented as a multi-view graph, 
in which both directed and undirected edges are present. The nodes are 
images and related texts and an undirected edge between two nodes 
represent that there is some text similarity between them and a directed edge 
between them represent the temporal similarity. Each vertex is assigned a 
weight depending on the relevance with the query. Next a minimum weight 
dominating set is found from the query relevant objects. As stated in the 
paper- “the Minimum-Weight Dominating Set Problem (MWDS) is the problem 
of finding, given a vertex-weighted undirected graph G, from all dominating 
sets of G the one whose total vertex weight is the smallest.” An approximate
greedy algorithm is discussed clearly in the paper. Next the storyline is 
generated by connecting these Dominating Objects via Directed Steiner Tree. 

Both the above problems are found to be NP hard problems.

Next in the paper, there is some detailed work done on comparing the
effectiveness of results produced by this algorithm as compared to other
available techniques. A toolkit named ROUGE is used to compare the
effectiveness of summary in each case. It was found that whether it was text
summarization or graph generation in both cases MWDS outperforms all the
other clustering algorithms. Also, as this is quite clearly a subjective
problem so user studies were also held which predicted the same result.

This paper introduces a novel framework for generating pictorial storylines
for given topics from text and image data on the Internet. Unlike traditional
text summarization and timeline generation systems, the proposed framework
combines text and image analysis and delivers a storyline containing textual,
pictorial, and structural information to provide a sketch of the topic
evolution. A key idea in the framework is the use of an approximate solution
for the dominating set problem. Given a collection of topic-related objects
consisting of images and their text descriptions, a weighted multi-view graph
is first constructed to capture the contextual and temporal relationships
among these objects. Then the objects are selected by solving the
minimum-weighted connected dominating set problem defined on this
graph. Comprehensive experiments on real-world data sets demonstrate the
effectiveness of the proposed framework.

[pdf]

}}

GRAMMAR EVOLUTION




@inproceedings{beuls-hofer-11ijc_emergence-of-grammatical-agreement-in-language-games,
 title={Simulating the emergence of grammatical agreement in multi-agent language games},
  author={Beuls, K. and H{\"o}fer, S.},
  booktitle={Proceedings of the 22nd IJCAI Volume One},
  pages={61--66},
  year={2011},
  annote = {

Agreement: when a possessive (his,her) or a verb / adjective (as in Hindi)
"agrees" in gender, number, person or some other aspect to the "head" of a
phrase (e.g. subject).  e.g. John left HIS room. He GOES.  They GO.

This paper uses game theoretic models - in a language population where
small changes can occur between generations, shows the
evolution of agreement is a result of reducing effort.  ??project??

---review 2013 Swati

Why do natural languages use complex grammatical concepts? We believe that
many of these concepts emerged in order to guide linguistic processing and to
reduce processing effort.

In this work we investigate the concept of grammatical agreement. Grammatical
agreement has become an essential part that guides linguistic processing. We
hypothesize that agreement emerged in natural languages in order to
facilitate the recognition of phrasal constituents, and hence, reduce
processing effort.

In order to support our hypothesis we develop an experiment consisting of
artificial agents playing so-called language games. We show that the agents
are able to bootstrap a simple agreement system which enables them to
recognize phrasal constituents with much less effort.

The most important concepts discussed here are language games and their
results, constituent marketing strategy and various feature selection
methods.

The language game model was first proposed by Steels to study the dynamics of
language interaction between agents. Agents here could be humans or
robots. The results of the language games are, if agents are allowed to
invent and learn agreement markers, the processing effort decreases over time
otherwise it stays high.

In the constituent marking strategy, the speaker attaches markers to
linguistic units. Units that belong together are marked by the same
marker. Each marker represents a semantic feature, such as gender, number,
determination, etc. The agent identifies discriminating grammatical features
of the phrases that correspond to its perception. It then associates the
markers with the selected features and attaches each marker to all members of
the matching phrase.

Sometimes we don’t have unique discriminating feature so we compare the
following feature selection methods:

	- Single feature selection (both agents select same feature), 
	- Random feature selection (randomly select) and 
	- Emergent feature selection method (preference of feature selected
		in past). 

In this paper semantic agreement features were more focussed, we could even
apply same and could record the results for syntactic ones. For the sake
simplicity few assumptions were taken for the implementation of language
games.

First, use of one-one feature form mapping which forced to select only one
feature but this could be extended. Marker string can be mapped to feature
matrix i.e. .each marker can be marked with different features. This
extension will lead us to a new discussion on emergence of feature matrices

Second, the agents always invent immediately as soon as they experience
cognitive effort in terms of indistinguishable referents. We can introduce a
threshold for invention -  agents only invent a marker for a certain feature if
they face cognitive effort very often without disposing of a marker for that
feature.

In future we can investigate the complexity of agreement systems further by
extending our experiments and verifying the results on a larger population
with more and more features.

Also we have only used two agents with same perception of scene and same, we
could use two agents with different perception of scene, we will see result
is that communicative success is not guaranteed.

---

abstract:
Grammatical agreement is present in many of the world's languages today and
has become an essential feature that guides linguistic processing. When two
words in a sentence are said to "agree," this means that they share certain
features such as "gender," "number," "person" or others. The primary
hypothesis of this paper is that marking agreement within one linguistic
phrase reduces processing effort as phrasal constituents can more easily be
recognized. The drive to reduce processing effort introduces the rise of
agreement marking in a population of multiple agents by means of an
incrementally aligned mapping between the most discriminatory features of a
particular linguistic unit and their associative markers. A series of
experiments compare feature selection methods for one-to-one agreement
mappings, and show how an agreement system can be bootstrapped.


[pdf]

}}

IMAGE / VIDEO DESCRIPTION




@article{barbu-siskind-12_video-in-sentences-out,
 title={Video In Sentences Out},
  author={Barbu, A. and Bridge, A. and Burchill, Z. and Coroian, D. and Dickinson, S. and Fidler, S. and Michaux, A. and Mussman, S. and Narayanaswamy, S. and Salvi, D. and others},
  journal={arXiv preprint arXiv:1204.2742},
  year={2012}
  annote = {

Given a video - generate a narrative for it.

#### GET body-posture codebook? 

---2013 Review Nitish Gupta (10461) ~gnitish/cs365/hw2/review.pdf

Introduction: 

The authors aim at presenting a system that produces sentential descriptions
of short video clips. The sentences describe who did what to whom, and where
and how they did it. This system describes observed actions, the participant
objects, their properties and spatial relations in the sentential
description.

The sentential description has following parts: 
1. Verb: Describes the observed Action
2. Noun Phrases: Describe the participants.
This further has two classes:
 *  Agent : Doer of the action
 *  Patient : The affected Object

3. Adjective Modifiers: Describe properties of objects.
4. Prepositional Phrases: Describes spatial relation.

[FIG]
Vocabulary used to generate sentential description

--Video Corpus Used--

The video corpus used is produced by DARPA for the Mind’s Eye program. A total no. of 3480 videos for 
training and 749 videos for testing purposes are used.

--Overall System Architecture--

The system is based on 4 components: 

1. Object Detection and Tracking

Detection based tracking is employed in which an object detector is applied
to each frame to yield a set of candidates detections which are composed into
tracks by selecting a single candidate detection from each frame that
maximizes temporal coherency of the track. Felzenszwalb et al.[2] detectors
are used.  The system is trained for 25 object classes which form around 90%
of the event participants.

2. Body-Posture Codebook

Body posture information is derived using part structure produced by the
Felzenszwalb et al. [2] detectors. A codebook is trained by running each
pose-specific person detector on annotated human-pose samples.

The figure below shows sample clusters from the systems codebook.

3. Event Classification

The systems tracker produces one or more tracks per object class. One track
is taken to designate the ‘agent’, and another track (if present) to
designate the ‘patient’. During training the track-to-role mapping is done
manually and during testing the mapping is done automatically by examining
all such mapping and selecting one with maximum likelihood.

For each participant in the video a set of features are stored such as ‘x and
y coordinates’, magnitude and direction of velocity and acceleration, object
class, body-posture codebook index.

For each pair of event participants the features stored are distance between
the participants.

With this information Hidden Markov Models are used to classify actions
taking place in the video.

4. Generating Sentences

The sentences are generated from detected action class together with
knowledge of the participants using the following template.

X and Y denote subject and object noun phrases, and the categories Adv.,
PPendo, and PPexo denote adverbs and prepositional phrase adjuncts to
describe the subject motion.

The verb-phrase template indicates whether an adverb is allowed or
not. Velocities of different participants are compared to bank on adverbs
like ‘quickly’, ‘slowly’ etc. Prepositional-phrase adjuncts are used to
describe motion direction. Different properties like color, size, height
etc. are also decided by similar methods.  

Experimental Results

For each video, we generated sentences corresponding to the three most-likely
action classes. For 49.4% (370/749) of the videos at least one of the three
generated sentences was deemed true and for 18.4% (138/749) of the videos at
least one of the three generated sentences was deemed salient.

References

[1] Barbu, A. and Bridge, A. and others, Video In Sentences Out. (Original
Paper chosen by me). 

[2] P. F. Felzenszwalb, R. B. Girshick, and D. McAllester. Cascade object
detection with deformable part models. In CVPR, 2010a.

[3] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object
detection with discriminatively trained part based models. PAMI, 32(9),
2010b.

---
abstract:
We present a system that produces sentential descriptions of video: who did
what to whom, and where and how they did it. Action class is rendered as a
verb, participant objects as noun phrases, properties of those objects as
adjectival modifiers in those noun phrases, spatial relations between those
participants as prepositional phrases, and characteristics of the event as
prepositional-phrase adjuncts and adverbial modifiers. Extracting the
information needed to render these linguistic entities requires an approach
to event recognition that recovers object tracks, the track-to-role
assignments, and changing body posture.


[pdf]

}}



@inproceedings{bergsma-vanDurme-11ijc_learning-bilingual-lexicons-using-image-similarity,
 title={Learning bilingual lexicons using the visual similarity of labeled web images},
  author={Shane Bergsma and Benjamin Van Durme},
  booktitle={Proceedings of the 22nd IJCAI Volume Three},
  pages={1764--1769},
  year={2011},
  abstract = {

very innovative -
largely vision-based - needs to use some kind of image similarity map
(e.g. bag of words) - to map labels found in documents in different
languages.

VISION? 

---review 2013 Ujjwal K Singh

Paper deals with the problem of word to word translation of a language in
another language as for most of the machine translation parallel data is
unavailable. For the purpose, it uses images to find possible translation of
a word. Comparing the different sets of images labeled with words in a two
languages (no identical images) to get the efficient translation of word. It
could be used to extend the reach of image search engines also.  “ use these
explicit, monolingual, image-to-word connections to successfully learn
implicit, bilingual, word-to-word translations”

Approach: Used Google's 20 images for each word candle (English) and vela
(Spanish) as their image sets and using color histogram, SIFT keypoints (as
visual similarity) and Normalized Edit Distance (as orthographic similarity).

Given words suppose candle (English) and vela (Spanish), used Google image
search to acquire a corresponding set of images, then they extract visual
features from these sets by creating color histogram and SIFT keypoints and
also used Normalized Edit Distance (as orthographic similarity) where

Color Histogram is partitioned color space containing the count of image
pixels(R, G, B) that occur in each partition. They used 4096-dimensional
vector space for the experiment.

SIFT keypoints where SIFT stands for Scale Invariant Feature Transform, are
features which are immune to scaling, rotation, illumination and distortion
and are multidimensional vectors. These features are clustered into K cluster
centroids using K-means algorithm (same as k-nn but only we define k nearest
to mean rather using distance for it). Signal processing terminology is used
to find codeword(cluster centroid in K-dimensional SIFT
codebook). Quantization of keypoints is done for a particular image.  

“Each dimension in the resulting feature vector corresponds to a codeword;
each value is the count of the number of keypoints mapping to that word.”

NED: Algorithm to find dynamically number of insertions, deletion and
substitutions required to convert one string to another.  

[EQNS]

Above two formulas are used to score the visual similarity of two words using
their associated image sets, and then this score for similarity to rank
translation pairs for bilingual lexicon induction.

In paper, two ways to create lexicons of physical objects are listed which
are later used to perform experiments : precise but low-coverage
pattern-based approach and higher-coverage but noisier approach based on
distributional similarity with a seed lexicon.

In pattern-based lexicon creation rank of words are decided by their
conditional probability of co-occurring with the pattern and words are
filtered if they occur in corpus < 50% and the first 500 sample is taken for
experiment. But there is one problem, matching words with their visual
feature.

In later, larger list are created that occur in the same context as the seed
list for the physical objects.They used the contexual similarity to rank the
words(unigrams).  

“We exploit the availability of large corpora in each
language to rank a list of unigrams by their contextual similarity with the
seeds. Contextual similarity is defined as the cosine similarity between
context vectors, where each vector gives the counts of words to the left and
right of the target unigram.”

Alternately locally-sensitive hash algorithm can be used to approximate
cosine computation and building low dimentional bit signatures and ranking is
done by checking their similarity with the ten most similar lexicons and top
20000 is taken for experiment.

In Experiment 1 using pattern-based lexicon creation they used the data to
calculate AVGMAX, the number of images in each image set (20) and the SIFT
codebook dimensionality. MRR is mean reciprocal rank of correct translation
and tr(i) is the position of target lexicon with index i .Goal of this
experiment is given in image.

Table below shows the result of the experiment which shows %performance and
it can be seen AvgMax performs better than MaxMax. AvgMax increases
quadratically with the number of images that can be seen in the
graph. Increase in codewords results in specific visual representation, not
including false positive matches at the cost of more general
similarities. They also concluded that using visual similarities along with
orthographic similarities results in better performance.

AVGMAX MRR  :  36.0
MaXMAX MRR  :  31.5

In Experiment 2 ,creation of bilingual lexicons using distributional
similarities 20,000 words list each having image set of 20 images. Only SIFT
and NED are used for this experiment. From below graph it can be easily seen
only SIFT doesn't perform well but

NED alone improves the performance drastically and SIFT + NED further
increase the performance. It also provided the low-frequency noun like
fishhook and rosary to be correctly translated which can't be found in many
parallel text.

Thus, concluding that bilingual lexicon induction performance can be improved
using the labeled images and unrelated language pairs benefits will be
further higher and also visual features along with orthographic features
provides substantial gain over orthographic features alone.

---

abstract:
Speakers of many different languages use the Internet. A common activity
among these users is uploading images and associating these images with words
(in their own language) as captions, filenames, or surrounding text. We use
these explicit, monolingual, image-to-word connections to successfully learn
implicit, bilingual, word-to-word translations. Bilingual pairs of words are
proposed as translations if their corresponding images have similar visual
features. We generate bilingual lexicons in 15 language pairs, focusing on
words that have been automatically identified as physical objects. The use of
visual similarity substantially improves performance over standard approaches
based on string similarity: for generated lexicons with 1000 translations,
including visual information leads to an absolute improvement in accuracy of
8-12% over string edit distance alone.


[pdf]

}}



@inproceedings{narayanaswamy-siskind-11_visual-language-for-estimating-object-pose,
 title={A visual language model for estimating object pose and structure in a generative visual domain},
  author={Narayanaswamy, S. and Barbu, A. and Siskind, J.M.},
  booktitle={Robotics and Automation (ICRA), 2011 IEEE International Conference on},
  pages={4854--4860},
  year={2011},
  annote = {

Vision is guided by "grammar" of parts that join to become assemblies.

uses a Lincoln Logs grammar --> known how they may orient. 

---2013 review Mohit Agarwal ~mohita/cs365/hw2/Paper%20Review.pdf

In this paper Siddharth Narayanaswamy, Andrei Barbu, and Jeffrey Mark Siskind
proposed a visual language model for object recognition similar to that of
human language model. This proposed approach determines precise pose,
assembly structure, 3D pose of each component of Lincoln Log assemblies.

Proposed Visual language model is better and complex as compared to human
language model as it is context sensitive and it deals with occlusion. As
Visual Language model is context sensitive unlike to context free grammar of
human language, it’s symbolic structure take the form of graphs with cycles
and formulated as a stochastic constraint satisfaction problem.

Proposed model is illustrated using Lincoln Log assemblies. Combination of Lincoln Log in different ways yield very large no. of 
assemblies, thus building a combinatorially large of objects. It utilizes the fact that scenes and objects are represented as 
descriptions involving parts and spatial relations and focuses on generative domains (that can generate a large class of distinct 
structures from small class of components).

VALID LINCOLN LOG STRUCTURES: Those Lincoln whose notches are aligned and medial axes are parallel 
to work surface. Logs are placed one over the other in a fashion that even no. of layers are mutually 
parallel to each other,  and so do odd no. of layers. Projection of even and odd layers on work surface 
intersects at 90 degree. These lines are termed as grid. The grid coordinate system is then transformed to 
camera coordinate system.

STRUCTURE POSE ESTIMATION: First of all foreground is separated from background using masking. For simplicity, only three 
degrees of freedom are considered for structure pose, horizontal translation on work surface and yaw around vertical axes. 

Now pose p is estimated by maximizing the distance between the set of
projected grid lines Lg and the of image edge line segments Li. This pose
estimate is further refined by maximizing the coincidence between projected
grid lines and the of image edge points Pi. Pose estimation method converges
very fast but is error prone which is made accurate by refining. We cannot
use refining directly, as it gives result only for close initial estimated
provided by pose estimation method.

LOG OCCUPANCY DETERMINATION ALONG WITH EVIDENCE: At each grid position (point
at notch center) log occupancy is determined using image evidence. Lincoln
Logs generate two image features log ends (elliptical projection of circular
log ends) and log segments (line segments from projection of cylindrical
walls). For each grid position q, random variables Zq+, Zq-, encode the
presence and absence of log ends and similarly Zqu, Zqv, Zqw encode the
presence/absence of log segments.

Now for each pose p, manifested ellipse parameters are obtained using least
square fit of 20 equally spaced 3D points.  Similarly parameters of line
segments manifested by Zqu, Zqv, Zqw are obtained. Thus evidence for log end
and log segments is determined. Now, Evidence is mapped to priors for the
log-segment and log-end evidence functions respectively on a set of 30
images.

LINCOLN LOG GRAMMAR: Lincoln Log grammar is formulated which is none other
than specific constraints and rules for assembly of Lincoln log structures
like the grammar of human language.

STRUCTURE ESTIMATION: Structure is estimated by establishing uniform prior
over Zq and marginalizing the random variables that correspond to log ends
and log segments and condition this marginal distribution on the language
model. Finally, we compute the assignment to the random variables Zq that
maximizes this conditional marginal probability

[EQN]
 
OCCLUSION: If we know that a log end or log segment is occluded then we
ignore all evidence for it from the image, giving it chance probability of
being occupied. With this, the grammar can often fill in the correct values
of occluded random variables for both log ends and log segments, and thus
determine the correct value for an occluded Zq.

EXPERIMENTAL RESULTS: Presented approach is applied to total 160 images of 32
distinct Lincoln Log structures. After, Foreground-background operation, pose
and structure estimation, it was found that proposed method determined the
correct component type (Zq) of most occluded rows in the assembly. Role of
Grammar is further analyzed on accuracy of structure estimation. Finally Pose
and structure estimation were found sufficiently robust for robotic
manipulation.

Disclaimer: Images and Text used in this review is taken from no external
source other than paper itself

---

Abstract—We present a generative domain of visual objects by analogy to the
generative nature of human language.  Just as small inventories of phonemes
and words combine in a grammatical fashion to yield myriad valid words and
utterances, a small inventory of physical parts combine in a grammatical
fashion to yield myriad valid assemblies. We apply the notion of a language
model from speech recognition to this visual domain to similarly improve the
performance of the recognition process over what would be possible by only
applying recognizers to the components. Unlike the contextfree models for
human language, our visual language models are context sensitive and
formulated as stochastic constraintsatisfaction problems. And unlike the
situation for human language where all components are observable, our methods
deal with occlusion, successfully recovering object structure despite
unobservable components. We demonstrate our system with an integrated robotic
system for disassembling structures that performs whole-scene reconstruction
consistent with a language model in the presence of noisy feature detectors.


[pdf]

}}



@inproceedings{luW-zettlemoyer-08_generative-parsing-NL-to-meaning,
 title={A generative model for parsing natural language to meaning representations},
  author={Lu, W. and Ng, H.T. and Lee, W.S. and Zettlemoyer, L.S.},
  booktitle={Proceedings of the Conference on Empirical Methods in Natural Language Processing},
  pages={783--792},
  year={2008},

How many states do not have rivers ? : 8 words incl punctuation
  = all states - state(x) & located-river(x)

Lu et al. (2008) (Lu08) developed a generative model that builds a single
hybrid tree of words, syntax and meaning representation. These algorithms are
all language independent but representation specific.

ABSTRACT:
In this paper, we present an algorithm for learning a generative model of
natural language sentences together with their formal meaning representations
with hierarchical structures. The model is applied to the task of mapping
sentences to hierarchical representations of their underlying meaning. We
introduce dynamic programming techniques for efficient training and
decoding. In experiments, we demonstrate that the model, when coupled with a
discriminative reranking technique, achieves state-of-the-art performance
when tested on two publicly available corpora.  The generative model degrades
robustly when presented with instances that are different from those seen in
training. This allows a notable improvement in recall compared to previous
models.


[pdf]

}}

TEXT / DOCUMENT DATA MINING


@shahaf-guestrin-10_connecting-dots-between-news
shahaf-guestrin-11ijc_connecting-between-news articles,
 title={Connecting the dots between news articles},


2013 review ankit modi ~amodi/cs365/hw2/Research_Paper_Summary.pdf

This paper aims at fighting information overload by providing a structured,
easy way to navigate between topics. Given two news articles, the explained
system automatically finds a coherent chain linking them together. Concepts
of Coherence and Influence were used to do so. Coherence is a measure of
activation patterns of a word while Influence measures the role of a word in
connecting two articles.

Given a set of chronologically ordered articles/documents d1, d2,…, dn and a
set of words w1, w2,…, wn. 

	Coherence (d1…dn) = min i=1…n-1 Σw Influence (di,di+1 | w)

Firstly, a bipartite directed graph, G = (V,E) is constructed. The vertices V
= VD ∪ VW correspond to documents and words. For every word w in document d,
both edges (w, d) and (d,w) are added. Edge weights(TF IDF weights may be
used) represent strength of correlation between a word and a document. This
graph was used to define Influence. Variable node-activew,i indicatess the
activation level of w during transition from di to di+1. Variable
words-initw,i indicates the initialisation level of w during transition from
di to d_i+1.

A linear programming model was used to attain the objective. LP had 4 modules 
as follows :
1. Chain restriction : It ensures a proper chain i.e. it starts with source s, ends 
with target t, has k nodes and k-1 edges.
node-active1 = 1, node-activen = 1
Σi node-activei = k
Σi,j next-nodei,j = k-1
Σi next-nodei,j = node-activej
; j ≠ s
Σj next-nodei,j = node-activei
; i ≠ t
∀i>=j next-nodei,j = 0 (ensuring chain’s chronological order)
∀i

@inproceedings {chambers-jurafsky-11_template-script-extraction-from-text,
 title =	 {Template-based information extraction without the templates},
  author =	 {Nathanael Chambers and Dan Jurafsky},
  booktitle =	 {Proceedings of the 49th ACL Human Language Technologies HLT-2011},
  pages =	 {976–986}.
  year =	 {2011},
  annote = {

abstract: 
Standard algorithms for template-based information extraction (IE) require
predefined template schemas, and often labeled data, to learn to extract their
slot fillers (e.g., an embassy is the Target of a Bombing template). This
paper describes an approach to template-based IE that removes this
requirement and performs extraction without knowing the template structure in
advance. Our algorithm instead learns the template structure automatically
from raw text, inducing template schemas as sets of linked events (e.g.,
bombings include detonate, set off, and destroy events) associated with
semantic roles.  We also solve the standard IE task, using the induced
syntactic patterns to extract role fillers from specific documents. We evaluate
on the MUC-4 terrorism dataset and show that we induce template structure
very similar to handcreated gold structure, and we extract role fillers with
an F1 score of .40, approaching the performance of algorithms that require
full knowledge of the templates

RELATED: ICML-12 abstract: 
Learning the Central Events and Participants in Unlabeled Text
Nathanael Chambers and Dan Jurafsky

Abstract: The majority of information on the Internet is expressed in written
text. Understanding and extracting this information is crucial to building
intelligent systems that can organize this knowledge. Today, most algorithms
focus on learning atomic facts and relations. For instance, we can reliably
extract facts like 'Annapolis is a City' by observing redundant word patterns
across a corpus. However, these facts do not capture richer knowledge like
the way detonating a bomb is related to destroying a building, or that the
perpetrator who was convicted must have been arrested. A structured model of
these events and entities is needed for a deeper understanding of
language. This talk describes unsupervised approaches to learning such rich
knowledge.

[pdf]

}}



@InProceedings{mnih-tehYW-12icml_neural-probabilistic-language-models,
  author =    {Andriy Mnih and Yee Whye Teh},
 title =     {A fast and simple algorithm for training neural probabilistic language models},
  booktitle = {Proceedings of the 29th International Conference on Machine Learning (ICML-12)},
  series =    {ICML '12},
  year =      {2012},
  editor =    {John Langford and Joelle Pineau},
  location =  {Edinburgh, Scotland, GB},
  isbn =      {978-1-4503-1285-1},
  month =     {July},
  publisher = {Omnipress},
  address =   {New York, NY, USA},
  pages=      {1751--1758},
  annote = {

--- review 2013 Rohitangsu Das  ~rohitdas/cs365/hw2/paper_cs365.pdf

--Neural Probabilistic Language Models--.  

the NLPM models perform better on many tasks than n-grams method.  NLPM
models the distribution for the next word as a smooth function of learned
multidimensional real-valued representations of the context words and the
target words.  One of the main drawbacks of this algorithm is in computing
real-valued representation of models whose time complexity is quadratic in d
(dimension of feature space of words).

These models forms the backbone of many important areas such as Speech
Recognition systems,  Machine Translation and Information Retrieval Systems.
The main aim is to predict a given word from a set of valid words given a
context(set of words),in the hope that the predicted word holds some meaning
which is vital for the motivation of a good trained model.  

For example,this algorithm was used to complete sentences based on the
trained large neural language models for the Microsoft Research Sentence
Completion Challenge. This algorithm therfore what one looks to do,is reduce
the vocabulary size or use a context weight matrix which reduces the
complexity to d.

One of the interesting things about the model described in equation1 can
describe both Neural and Maximum entropy models by simply changing the score
function.  Two forms of learning algorithms are employed :

1. Maximum Likelihood training which calculate the maximum log likelihood
   function,but isn’t reasonable because of large vocabulary size sets.  Then
   we use Importance sampling to speed up training.

2. Noise Contrastive estimation which reduces the problem to that of binary
   classification of samples into 2 sets (data and noise).  Though we have
   shown that the unigram noise distribution is sufficient for training
   neural language models efficiently.  Though if context-dependent noise
   distribution is used,it could have been better.

Our results show that the learning algorithm is very stable and can produce
models that perform as well as the ones trained using maximum likelihood in
less than one-tenth of the time. In a largescale test of the approach, we
trained several neural language models on a collection of Project Gutenberg
texts, achieving state-of-the-art performance on the Microsoft Research
Sentence Completion Challenge dataset. [1]

References

[1] Andriy Mnih and Yee Whye Teh. A fast and simple algorithm for
training neural probabilistic language models. 2012.

---

Abstract: Neural probabilistic language models (NPLMs) have recently
superseded smoothed n-gram models as the best-performing model class for
language modelling. Unfortunately, the adoption of NPLMs is held back by
their notoriously long training times, which can be measured in weeks even
for moderately-sized datasets. These are a consequence of the models being
explicitly normalized, which leads to having to consider all words in the
vocabulary when computing the log-likelihood gradients. We propose a fast and
simple algorithm for training NPLMs based on noise-contrastive estimation, a
newly introduced procedure for estimating unnormalized continuous
distributions. We investigate the behaviour of the algorithm on the Penn
Treebank corpus and show that it reduces the training times by more than an
order of magnitude without affecting the quality of the resulting models. The
algorithm is also more efficient and much more stable than importance
sampling because it requires far fewer noise samples to perform well. We
demonstrate the scalability of the proposed approach by training several
neural language models on a 47M-word corpus with a 80K-word vocabulary,
obtaining state-of-the-art results in the Microsoft Research Sentence
Completion Challenge. 

discussion+video: http://icml.cc/discuss/2012/855.html


[pdf]

}}

@inproceedings{teh-06_hierarchical-bayesian-language-model,
  title={A hierarchical Bayesian language model based on Pitman-Yor processes},
  author={Teh, Yee Whye},
  booktitle={Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics},
  pages={985--992},
  year={2006},
  annote = 


}}



@inproceedings{shahaf-guestrin-11ijc_connecting-between-news articles,
 title={Connecting the dots between news articles},
  author={Dafna Shahaf and Carlos Guestrin},
  booktitle={Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining},
  pages={623--632},
  year={2010},
  abstract = {

The process of extracting useful knowledge from large datasets has become one
of the most pressing problems in today’s society. The problem spans entire
sectors, from scientists to intelligence analysts and web users, all of whom
are constantly struggling to keep up with the larger and larger amounts of
content published every day. With this much data, it is often easy to miss
the big picture. In this paper, we investigate methods for automatically
connecting the dots – providing a structured, easy way to navigate within a
new topic and discover hidden connections. We focus on the news domain: given
two news articles, our system automatically finds a coherent chain linking
them together. For example, it can recover the chain of events leading from
the decline of home prices (2007) to the health-care debate (2009). We
formalize the characteristics of a good chain and provide efficient
algorithms to connect two fixed endpoints. We incorporate user feedback into
our framework, allowing the stories to be refined and personalized. Finally,
we evaluate our algorithm over real news data. Our user studies demonstrate
the algorithm's effectiveness in helping users understanding the news.


[pdf]

}}



@InProceedings{pachauri-collins-singh-12icml_matching-using-domain-knowledge,
  author =    {Deepti Pachauri and Maxwell Collins and Vikas Singh and Risi Kondor},
  title =     {Incorporating Domain Knowledge in Matching Problems via Harmonic Analysis},
  booktitle = {Proceedings of the 29th International Conference on Machine Learning (ICML-12)},
  series =    {ICML '12},
  year =      {2012},
  editor =    {John Langford and Joelle Pineau},
  location =  {Edinburgh, Scotland, GB},
  isbn =      {978-1-4503-1285-1},
  month =     {July},
  publisher = {Omnipress},
  address =   {New York, NY, USA},
  pages=      {1271--1278},
  annote = {

--review 2013 Amit Kumar ~akkumar/cs365/hw2/report.pdf

Matching two objects in machine learning is a tough task. This can be solved
by reducing it to some form of QAP.[2] The quadratic assignment problem (QAP)
is a combinatorial optimization problems.  [1] The QAP is very hard to
solve. It is a problem of matching two weighted graphs G and G' with
adjacency matrices A and A',  to maximize the overlap between them.

Summary

In this paper author gives an algorithm to solve Matching problem by solving
QAP problem based on the learning a modified objective function f from a set
of prior “training” QAP instances. At test time, only a sample set (training
set) X is known, and a prediction function f : X → Y maps it to a predicted
label from the label space.[3] There is one property of this function, the
inverse map of f can be expressed, which makes this f learnable.[2] This
function is expressed as the graph correlation between the two graphs G and
G'.[2] We can use features such as Euclidean distance between interested
points,

Shape context features etc for establishing relationship between the vertices
of G.[2] They transformed the learning problem using ideas from the theory of
non-commutative Fourier analysis on the symmetric group. [2]

Contributions

The contribution of this paper is the parameter learning framework for a
class of combinatorial problems where the solution is a candidate in the
symmetric group Sn . We show how the representation theory of Sn makes the
procedure computationally tractable, and how Branch and Bound schemes can be
modified to learn information relevant for problem instances coming from an
application of interest.[2]

Uses

In computer vision this can be helpful in finding the correspondence between
multiple images of the same scene taken from different viewpoint.[2] This can
be useful for matching a face from the set of samples of faces. This can also
be useful for identifying object.

In machine learning, In aligning examples before a meaningful similarity
measure can be computed between them.

REFERENCES
1. wikipedia page of quadratic assignment problem:
http://en.wikipedia.org/wiki/Quadratic_assignment_problem
2. pachauri-collins-singh-12icml_matching-using-domain-knowledge
3. wikipedia page of Structured SVM
http://en.wikipedia.org/wiki/Structured_SVM


Abstract: Matching one set of objects to another is a ubiquitous task in
machine learning and computer vision that often reduces to some form of the
quadratic assignment problem (QAP). The QAP is known to be notoriously hard,
both in theory and in practice. Here, we investigate if this difficulty can
be mitigated when some additional piece of information is available: (a) that
all QAP instances of interest come from the same application, and (b) the
correct solution for a set of such QAP instances is given. We propose a new
approach to accelerating the solution of QAPs based on learning parameters
for a modified objective function from prior QAP instances. A key feature of
our approach is that it takes advantage of the algebraic structure of
permutations, in conjunction with special methods for optimizing functions
over the symmetric group \m{\Sb_n} in Fourier space. Experiments show that in
practical domains the new method can significantly outperform existing
approaches. 

pachauri,mcollins@cs.wisc.edu risi@uchicago.edu vsingh@biostat.wisc.edu

discussion+video on ICML pages

[pdf]

}}



@inproceedings{caragea-silvescu-mitra-12_hashing-and-abstraction-sparse-high-D-features,
 title={Combining Hashing and Abstraction in Sparse High Dimensional Feature Spaces},
  author={Cornelia Caragea and Adrian Silvescu and Prasenjit Mitra},
  booktitle={Twenty-Sixth AAAI Conference on Artificial Intelligence},
  year={2012}
  annote = {

a popular approach to information retrieval from documents involves "bag of
words".  with a large vocabulary this becomes extremely high-dimensional and
computationally intractable.  In this work, one applies hashing and
agglomerative clustering to obtain a smaller set of sparse features.

abstract:
With the exponential increase in the number of documents available online,
e.g., news articles, weblogs, scientific documents, the development of
effective and efficient classification methods is needed. The performance of
document classifiers critically depends, among other things, on the choice of
the feature representation. The commonly used “bag of words” and n-gram
representations can result in prohibitively high dimensional input
spaces. Data mining algorithms applied to these input spaces may be
intractable due to the large number of dimensions. Thus, dimensionality
reduction algorithms that can process data into features fast at runtime,
ideally in constant time per feature, are greatly needed in high throughput
applications, where the number of features and data points can be in the
order of millions. One promising line of research to dimensionality reduction
is feature clustering. We propose to combine two types of feature clustering,
namely hashing and abstraction based on hierarchical agglomerative
clustering, in order to take advantage of the strengths of both techniques.
Experimental results on two text data sets show that the combined approach
uses significantly smaller number of features and gives similar performance
when compared with the “bag of words” and n-gram approaches.


[pdf]

}}



@inproceedings{fangJ-mitra-12aaai_table-header-detection-document,
 title={Table Header Detection and Classification},
  author={Jing Fang and Prasenjit Mitra and Zhi Tang and C. Lee Giles},
  booktitle={Twenty-Sixth AAAI Conference on Artificial Intelligence},
  year={2012}
  annote = {

detect tables in documents

---2013 Review Mohit Gupta ~mohitt/cs365/hw2/review_mohitt.pdf

The goal of the authors is to be able to detect tables from documents. There
exists a huge diversity in table styles and structure. It is assumed that it
is largely told by the table headers, i.e. the top rows and the leftmost
columns of a generic table. Hence, the problem now reduces to Table Header
Detection and Classification, which is where the paper focuses on.

The paper assumes that data rows would normally share similar data type, cell
alignment, character spacing, font and size etc. different from the
characteristics of the header (or header rows in case of a multi-dimensional
table). It does not focus on other attributes that could differentiate header
and data cells, for instance, background color and ruling lines.

First, a weighted average score analysis (involving font size score, string
length score, overlap score, data type score, alignment score etc. and taking
a weighted mean of these scores to define the match score of corresponding
cells of two adjacent rows) to calculate similarity between each pair of
consecutive rows and the first local minimum from top is marked to be the
separation, considering that headers would generally appear in the beginning.

The results are found to not to be satisfactorily accurate and in order to
improve the same, supervised learning algorithms are used for classification
of table content into headers and data classes.

Row features like row length, cell size, number of characters, fraction of
numeric, alphabetical and symbolic characters, font size, font type
(bold/italic) etc. are defined. Another set of scores based on the comparison
of neighboring rows are defined, which include average content repetition
(using Levenshtein distance) average alignment, average overlap proportion
etc. Analogously, features are defined for columns which are a subset of ones
for rows. The paper uses three types of classifiers–SVM based (using Libsvm),
logistic regression based (using R) and a random forest based classifier
(using Weka toolkit) for classifying header cells and data cells.

CiteSeerX – a public scientific search engine and digital library, is used as
a source of samples of actual scientific PDF documents. TableSeerX –an
existing table detection tool is used to extract tables from PDF
documents. The paper does not focus on other document media like image,
handwritten or web pages. The tables detected in error from two randomly
selected samples of 200 documents each are manually removed, which leaves 135
and 120 tables from sample 1 and sample 2 respectively, which is now the test
sample. The proportion of various header types (onedimensional when only a
row or column header exists, two-dimensional when both exist and
multi-dimensional headers) is determined for both samples. Another
classification is made based on table layout complexity –simple and complex
–further classified into tables having multi line cells, multi-level headers,
long and folded tables, multidimensional tables and other irregular
layouts. Both the data samples are found to have similar proportions of
classes and it is concluded that the dataset is stable.

Among the machine learning methods, which are evidently more accurate than
the heuristic method first applied, it is seen that random forest outperforms
the other two classifiers. The comparison of the accuracy between the
classifiers is achieved using three parameters –precision, recall and
F-measure (harmonic mean of precision and recall).

The impact of feature set (row feature set and neighboring row feature set)
is tested on the random forest classifier. It is found that using both the
feature sets together gives the most accurate results (97.4%). The InfoGain
attribute evaluator of the Weka toolkit is used to show the most effective
features to be –number of characters, fraction of alphabetic characters, font
size, average alignment and consistency of data type.

---


[pdf]

}}



@inproceedings{heZ-chenC-heX-12_document-summarization-based-on-reconstruction,
 title={Document Summarization Based on Data Reconstruction},
  author={Zhanying He and Chun Chen and Jiajun Bu and Can Wang and Lijun Zhang and Deng Cai and Xiaofei He},
  booktitle={Twenty-Sixth AAAI Conference on Artificial Intelligence},
  year={2012},
  annote= {


Outstanding Paper Award

ROUGE score - measures effectiveness.
interesting multi-technique comparison fig.1

Most summarization programs find the most relevant sentences, replicating
them at a rate of 1:3 or some such desired rate.

the abstract raises the expectation that this paper generates new sentences
by combining others.  but this is not the goal; it still uses full sentences,
based on the DUC corpora from NIST.

---review 2013 G. Chandra Sekhar,  10267 ~gchandra/cs365/hw2/

Introduction:

Document Summarization is a procedure used to reduce the text in a document
in order to create a concise summary that retains the most important points
of the original document. Most commonly in real world we see this procedure
being used in snippet generation for search results and news headlines
generation.

Objective of the Paper:

To introduce a new technique by which documents can be summarized in a most
efficient manner so as to give maximum information possible with minimum
redundancy by reconstructing the data .

Two objective functions namely Linear Reconstruction and Nonnegative Linear
Reconstruction have been introduced which help us in establishing the
relationship among sentences.

Methods for Reconstruction :

Linear reconstruction method takes the linear combinations of the important
sentences from the document and then approximates the document.  This is
implemented by a greedy method which extracts entences sequentially and
optimizes the objective function which is the minimum of the difference
between the candidate vectors and reconstruction function.

Here v(i) is a weighted term frequency vector for sentence i. X is the
selected sentence set of cardinality striclty less than that of v. a(i)'s are
the attributes of corresponding reconstruction fucntion f(X).

In Non-negative linear reconstruction method only additive linear
combinations of the selected sentences are considered ignoring the
subtractive combinations using a multiplicative updating algorithm. The basic
idea is to nullify the redundant information at the beginning itself with the
help of non-negative constraints. In this paper the nonnegative linear
reconstruction has been formulated as a convex optimization problem for which
a global minima exists for all cases.  The Algorithm : (from the Research
Paper) After stemming and stop-word elimination, the document is decomposed
into individual sentences and a weighted term-frequency vector for every
sentence is created. All the sentences form the candidate set.

For any sentence in the document, DSDR selects the related sentences from the
candidate set to reconstruct the given sentence by learning a reconstruction
function for the sentence.

For the entire document (or, a set of documents), DSDR aims to find an
optimal set of representative sentences to approximate the entire document
(or, the set of documents), by minimizing the reconstruction error.

By minimizing the sum of reconstruction errors over all the sentences in the
document, DSDR picks the optimal set of representative sentences.  It is an
unsupervised algorithm.

Experimental Results :

Average F-measure performance on DUC 2006


Average F-measure performance on DUC 2007

(Figures taken from the original research paper)

DSDR here has been compared with several state-of-the-art summarization approaches using the ROUGE (Recall-Oriented Understudy for Gisting Evaluation) measure. This is a recall-based measure that determines how well a system-generated summary covers the content present in one or more human-generated model summaries known as references. It is recall-based to encourage systems to include all the important topics in the text. 

For the purpose of comparing different approaches F-measure has been used. The F-measure of four ROUGE metrics are shown in the experimental results : ROUGE-1, ROUGE-2, ROUGE-3 and ROUGE-L on DUC 2006 and DUC 2007 data sets. 

Conclusion and Contributions 

A greedy strategy has been used to solve the linear reconstruction problem and the nonnegative reconstruction problem is solved using multiplicative updating. The experimental results show that DSDR (with both reconstruction types) outperforms the already present state-of-the-art summarization approaches. Out of the two the more efficient method is the linear reconstruction method while DSDR and nonnegative reconstruction performs better by generating less reduntant sentences.  This is because DSDR extracts only those sentences which span the intrinsic subspace of the candidate sentence space. We see that these sentences are significant to reconstruct the input document, as a result enriches the quality of summary.  Under the DSDR framework, the sequential method of linear reconstruction is sub-optimal, so Nonlinear DSDR outperforms Linear DSDR. 

Acknowledgements 
http://en.wikipedia.org 
https://www.aaai.org/ocs/index.php/AAAI/AAAI12/paper/view/4991

---



ABSTRACT

Document summarization is of great value to many real world applications,
such as snippets generation for search results and news headlines
generation. Traditionally, document summarization is implemented by
extracting sentences that cover the main topics of a document with a minimum
redundancy. In this paper, we take a different perspective from data
reconstruction and propose a novel framework named Document Summarization
based on Data Reconstruction (DSDR). Specifically, our approach generates a
summary which consist of those sentences that can best reconstruct the
original document. To model the relationship among sentences, we introduce
two objective functions: (1) linear reconstruction, which approximates the
document by linear combinations of the selected sentences; (2) nonnegative
linear reconstruction, which allows only additive, not subtractive, linear
combinations. In this framework, the reconstruction error becomes a natural
criterion for measuring the quality of the summary. For each objective
function, we develop an efficient algorithm to solve the corresponding
optimization problem. Extensive experiments on summarization benchmark data
sets DUC 2006 and DUC 2007 demonstrate the effectiveness of our proposed
approach.

---

search engines can provide users with snippets as the previews of
the document contents (Turpin et al. 2007; Huang, Liu, and
Chen 2008; Cai et al. 2004; He et al. 2007). News sites usually describe hot
news topics in concise headlines to facilitate browsing. Both the snippets
and headlines are specific
forms of document summary in practical applications.

--Document Summarization based on Data Reconstruction (DSDR)--

minimizes the reconstruction error
for summarization. The algorithm procedure of DSDR is as
follows:

 - After stemming and stop-word elimination, we decompose the document into
	individual sentences and create a weighted term-frequency vector for
	every sentence. All the sentences form the candidate set.
 -  For any sentence in the document, DSDR selects the related sentences from
	the candidate set to reconstruct the given sentence by learning a
	reconstruction function for the sentence.
 - For the entire document (or, a set of documents), DSDR aims to find an
	optimal set of representative sentences to approximate the entire
	document (or, the set of documents), by minimizing the reconstruction
	error.


[pdf]

}}



@inproceedings{zhangR-gao-12_generating-coherent-summaries-w-text-aspects,
 title={Generating Coherent Summaries with Textual Aspects},
  author={Renxian Zhang and Wenjie Li and Dehong Gao},
  booktitle={Twenty-Sixth AAAI Conference on Artificial Intelligence},
  year={2012}

beyond words: uses
meta-phrase features

Queries are usually handled to extract relevant sentences by Information
Retrieval (IR) techniques such as query expansion.
(Li et al. 2006; Vanderwende et al. 2007) built into statistical (Daumé and
Marcu 2006), graph-based (Wan et al. 2007), and learning-based (Fuentes et
al. 2007; Schilder and Kondadadi 2008) models. The state of the art from this
camp, however, does not fit the nature of aspects as particular kinds of
information that a summary should include.


[pdf]

++++
}}



@article{chuang-11_text-patterns-semantic-class-learning,
 title={Text Patterns and Compression Models for Semantic Class Learning},
  author={Chuang, C.Y. and Lee, Y.H. and Hsu, W.L.}
  annote = {

Proceedings of the 5th International Joint Conference on Natural Language Processing, pages 1376–1381,
Chiang Mai, Thailand, November 8 – 13, 2011.

based on the web, try to cluster names such as Al Pacino and Tom Hanks into
one cluster (say, actors), and so on.  based on a graph-ranking algorithm.

nine semantic classes evaluated
            precision@100
actors          0.97
/actresses

kitchen	        0.72
item

philosopher     0.86

shape		0.69

adding noise (upto 500 documents) has little effect on accuracy. 

---review 2013 Subham Modi

Objective and Purpose:

* Semantic classes instances are extracted with the help of constructing
  wrappers based on specified seed instances. 
* The extraction thus created was checked with the help of a compression model.
* The result [1] thus obtained shows that our system performs quite
  consistently even when operating on a noisy text with a lot of possibly
  irrelevant documents. 

--Introduction--

Many methods were introduced in order to extract instances of semantic
classes.  Some were search engines used by major companies such as (Pasca,
2007; Chaudhuri et al., 2009), while others such as ontology learning
(Cimiano et al., 2004), co-reference resolution (McCarthy and Lehnert, 1995)
and advertisement matching (Chang et al., 2009).

All of these are basically constituted into distributional and pattern based
extraction.

--PPM method--

Here the author used text compression model called Prediction by Partial
Matching (PPM) (Cleary and Witten, 1984) to evaluate the contextual
similarity between a mention and seed instances which is measured by
compression ratio achieved when compressing the surrounding context of the
mention using the PPM model loaded with context statistics of seed
instances. [1]

For every input which is taken by this method, a prediction is generated
which takes the form a probability distribution and this distribution is then
passed on to an arithmetic coder.

Previous tokens help to predict the next token and the conditional
probability of each token is changed and maintained as soon as the next token
is inputted. This new preceding token along with previous ones now predicts
the next token.Algorithm and Results:

A set of seed instances and corpus are given as input and in return a ranked
list of extracted instances is outputted.

For each token in seed, a wrapper is created and is then added to a new set
W. Extra surrounding text are collected in into another set T.  PPM model as
described above is created bases on T.

Now for each token present in set W and each mention in Corpus captured by
every element of w if:

Compressibility Ratio > θ [Figure from [1]

Then skip this mention. θ = 0.3 by this paper.

Construct a wrapper-mention graph G and then the vertices in G are then
finally ranked by graph ranking algorithm.

A bagging approach was adapted to average over 30 runs of each algorithm.

In specific run, three seeds are selected randomly from a set of 10 prepared
instances.

The resulting 30 lexicons are then merged by a weighted voting function.

The algorithm is tried on an example as discussed in the original paper [1]
considering of 50 web pages retrieved from Google by submitting a query
containing three seed instances.

Results:

The proposed approach outperforms the baseline system and maintains the
precision as better as the evaluation includes more instances.

The baseline system has a performance drop whereas the proposed approach
shows a consistent performance irrelevant of presence of noise.

Therefore the mechanism showed by the author improves overall performance and
prevents extra and unwanted extractions.

References:

* [1] Chung-Yao Chuang, Yi-Hsun Lee, Wen-Lian Hsu. Text Patterns and
  Compression Models for Semantic Class Learning. ORIGINAL PAPER CHOSEN BY ME
* Surajit Chaudhuri, Venkatesh Ganti, and Dong Xin.2009. Exploiting web
  search to generate synonyms for entities. In Proceedings of the 18th
  International Conference on World Wide Web, WWW ’09, pages 151–160, New
  York, NY, USA. ACM. 
* Philipp Cimiano, Aleksander Pivk, Lars S. Thieme, and Steffen
  Staab. 2004. Learning taxonomic relations from heterogeneous sources of
  evidence. In Proceedings of the ECAI 2004 Ontology Learning and Population
  Workshop. 
* Marius Pas¸ ca. 2007. Weakly-supervised discovery of named entities using
  web search queries. In Proceedings of the 16th ACM Conference on
  Information and Knowledge Management, CIKM ’07, pages 683–690, New York,
  NY, USA. ACM. 
* William Chang, Patrick Pantel, Ana-Maria Popescu, and Evgeniy
  Gabrilovich. 2009.  Towards intentd-riven bidterm suggestion. In
  Proceedings of the 18th International Conference on World Wide Web, WWW
  ’09, pages 1093–1094, New York, NY, USA. 

---

Abstract
This paper proposes a weakly-supervised approach for extracting instances of
semantic classes. This method constructs simple wrappers automatically based
pon specified seed instances and uses a compression model to assess the
contextual evidence of its extraction. By adopting this compression model,
our approach can better avoid erroneous extractions in a noisy corpus such as
the Web. The empirical results show that our system performs quite
consistently even when operating on a noisy text with a lot of possibly
irrelevant documents.



[pdf]

}}



@inproceedings{nori-bollegala-12_multinomial-relations-social-data,
  title={Multinomial Relation Prediction in Social Data: A Dimension Reduction Approach},
  author={Nozomi Nori and Danushka Bollegala and Hisashi Kashima},
  booktitle={Twenty-Sixth AAAI Conference on Artificial Intelligence},
  year={2012}
  pages= {115-121},
  annote = {

####

--- review 2013 Pulkit Jain (10543) ~pulkitj/cs365/hw2/HW2.pdf

--Problem Addressed--

The explosion of internet in the recent days has led to emergence of social
data. Web services that produce/operate with this data involve social actions
such as tagging, annotating and other actions which involve heterogeneous and
multiple objects. User actions are modeled as relations. Prediction of these
actions is important for many fundamental services such as search engines and
recommendation systems. The work done addresses the action prediction problem
using dimensionality reduction approach.

--Proposed Solution--

The solution proposed is based upon reduction of a relation instance into
multiple binomial instances and using attribute information of involved
objects.

 * Nonlinear dimensionality reduction is applied to binomial relations which
    embeds objects and relations into a common subspace where each object is
    close to the relation it participates in.

 * Minimizing distances (Euclidian distance metric used) between the
    embedding of objects and relations results in solving an optimization
    problem which is formulated as a generalized eigenvalue problem.  This
    gives us exact solutions that are globally optimal and are robust to the
    sparse data.

 * Incorporating the attribute information in the solution only requires
    replacing the low dimensional embedding used previously by linear
    projections of the attribute information to obtain embedding in the
    required subspace. This addresses the “cold start” problem where the
    sparsity of data is high.

Thus the input to the process/algorithm followed includes sets of objects
that can participate in relations, and a set of observed relation
instances. The output obtained is a list of other possible relations amongst
the objects sorted in order of likelihood of the relation.

Input: S1, S2…SK - Set of Objects
	O (⊂ S1 X S2 X…..X SK) - Set of Observed Relation Instances
	Φ1, Φ2... ΦK - K matrices representing object attributes

Output: S1 X S2 X…..X SK \ O - Possible relations sorted in order of likelihood

--Results/Contributions--

 * Datasets from Twitter for prediction of “favorite” and “retweet” actions
    and del.icio.us (a social tagging service) were used for measuring the
    performance.

 * The data was also tested against the existing PARAFAC and TUCKER tensor
    decomposition methods along with the proposed solutions to build
    comparisons.

 * The proposed solution showed highest robustness when the data was sparse
    in terms of the relation instances.

 * The solution did not perform well when tested against data that was sparse
    in relations that involved particular objects, but the performance was
    high when attribute information of objects was used along with the
    relation information.

 * Overall, the proposed solution is highly robust to data sparsity in
    comparison to the existing methods of prediction using tensor
    decompositions.

REFERENCES
[1] Nozomi Nori;Danushka Bollegala; Hisashi Kashima 2012. Multinomial Relation Prediction in Social Data.
A Dimension Reduction Approach. In Proceedings of the Twenty-Sixth AAAI Conference on Artificial 
Intelligence.
[2] Lin, Y.-R.; Sun, J.; Castro, P.; Konuru, R.; Sundaram, H.;and Kelliher, A. 2009. Metafac: community 
discovery via relational hypergraph factorization. In Proceedings of the 15th ACM SIGKDD International 
Conference on Knowledge Discovery and Data Mining.
[3] CP Decomposition, Wikipedia. http://en.wikipedia.org/wiki/CP_decomposition
[4]Tucker Decomposition,
Wikipedia. http://en.wikipedia.org/wiki/Tucker_decomposition

---

abstract: 
The recent popularization of social web services has made them one of the
primary uses of the World Wide Web. An important concept in social web
services is social actions such as making connections and communicating with
others and adding annotations to web resources.  Predicting social actions
would improve many fundamental web applications, such as recommendations and
web searches.

One remarkable characteristic of social actions is that they involve multiple
and heterogeneous objects such as users, documents, keywords, and
locations. However, the high-dimensional property of such multinomial
relations poses one fundamental challenge, that is, predicting multinomial
relations with only a limited amount of data.

In this paper, we propose a new multinomial relation prediction method, which
is robust to data sparsity. We transform each instance of a multinomial
relation into a set of binomial relations between the objects and the
multinomial relation of the involved objects. We then apply an extension of a
low-dimensional embedding technique to these binomial relations, which
results in a generalized eigenvalue problem guaranteeing global optimal
solutions. We also incorporate attribute information as side information to
address the “cold start” problem in multinomial relation prediction.

Experiments with various real-world social web service datasets demonstrate
that the proposed method is more robust against data sparseness as compared
to several existing methods, which can only find sub-optimal solutions.

}}



@inproceedings{heZ-chenC-heX-12_document-summarization-based-on-reconstruction,
  title={Document Summarization Based on Data Reconstruction},
  author={Zhanying He and Chun Chen and Jiajun Bu and Can Wang and Lijun Zhang and Deng Cai and Xiaofei He},
  booktitle={Twenty-Sixth AAAI Conference on Artificial Intelligence},
  year={2012},
  annote= {

Outstanding Paper Award

ROUGE score - measures effectiveness. 
interesting multi-technique comparison fig.1

---

search engines can provide users with snippets as the previews of
the document contents (Turpin et al. 2007; Huang, Liu, and
Chen 2008; Cai et al. 2004; He et al. 2007). News sites usually describe hot news topics in concise headlines to facilitate browsing. Both the snippets and headlines are specific
forms of document summary in practical applications.

Most of the existing generic summarization approaches use a ranking model to
select sentences from a candidate set (Brin and Page 1998; Kleinberg 1999;
Wan and Yang 2007).  These methods suffer from a severe problem that top
ranked sentences usually share much redundant information. Although there are
some methods (Conroy and O’leary 2001; Park et al. 2007; Shen et al. 2007)
trying to reduce the redundancy, selecting sentences which have both good
coverage and minimum redundancy is a non-trivial task.

Previous studies have shown that there is psychological and physiological evidence for parts-based representation in the human brain (Palmer 1977; Wachsmuth, Oram,
and Perrett 1994; Cai et al. 2011). Naturally, a document
summary should consist of the parts of sentences. With the
nonnegative constraints, our method leads to parts-based reconstruction so that no redundant information needs to be
subtracted from the combination.

Extensive experiments on summarization benchmark data sets DUC 2006 and DUC
2007 demonstrate the effectiveness of our proposed approach.

--Document Summarization based on Data Reconstruction (DSDR)--

minimizes the reconstruction error
for summarization. The algorithm procedure of DSDR is as
follows:

 - After stemming and stop-word elimination, we decompose the document into
	individual sentences and create a weighted term-frequency vector for
	every sentence. All the sentences form the candidate set.
 -  For any sentence in the document, DSDR selects the related sentences from
	the candidate set to reconstruct the given sentence by learning a
	reconstruction function for the sentence.
 - For the entire document (or, a set of documents), DSDR aims to find an
	optimal set of representative sentences to approximate the entire
	document (or, the set of documents), by minimizing the reconstruction
	error. 

--

ABSTRACT

Document summarization is of great value to many real world applications,
such as snippets generation for search results and news headlines
generation. Traditionally, document summarization is implemented by
extracting sentences that cover the main topics of a document with a minimum
redundancy. In this paper, we take a different perspective from data
reconstruction and propose a novel framework named Document Summarization
based on Data Reconstruction (DSDR). Specifically, our approach generates a
summary which consist of those sentences that can best reconstruct the
original document. To model the relationship among sentences, we introduce
two objective functions: (1) linear reconstruction, which approximates the
document by linear combinations of the selected sentences; (2) nonnegative
linear reconstruction, which allows only additive, not subtractive, linear
combinations. In this framework, the reconstruction error becomes a natural
criterion for measuring the quality of the summary. For each objective
function, we develop an efficient algorithm to solve the corresponding
optimization problem. Extensive experiments on summarization benchmark data
sets DUC 2006 and DUC 2007 demonstrate the effectiveness of our proposed
approach.

[pdf]

}}



@inproceedings{jans-bethard-12_skip-grams-for-predicting-scripts,
  title={Skip n-grams and ranking functions for predicting script events},
  author={Jans, B. and Bethard, S. and Vulic, I. and Moens, M.F.},
  booktitle={Proceedings of the thirteenth conference of the European chapter of the association for computational linguistics (EACL 2012)},
  year={2012}
  annote = {

---2013 review chirag gupta ~gchirag/cs365/hw2/assignment2_10212.pdf

Improving the following:

 *  Identification of events from narrative script
 *  Gathering statistics from the event chains
 *  Choose ranking functions for predicting script events

Corpus Used: Reuters Corpus, Volume 1
	     Andrew Lang Fairy Tale Corpus

--Procedure--

Model takes as input a partial script and produces as output a ranked list of
events for that script.

I. They filtered out the non-narrative articles out of the corpuses.
II. Identifying event chains from the script

As in the previous works of Chambers and Jurafsky in 2008 and 2009, and 
in Mcintyre and Lapata in 2009 and 2010, the following are used:

   Stanford Parser for identifying the dependency structure.
   OpenNLP coreference engine for identifying entities in each article.

Finally event chains were constructed using all noun phrases associated 
with that entity and any subject/object dependencies with a verb were 
retrieved.

Either all, long or the longest event chains can be used for training a model.

III. Gathering event chain statistics

Regular bigrams, 1-skip bigrams and 2-skip bigrams are three strategies for 
collecting bigrams. On these bigrams event pair counts and various 
conditional probability calculations are done to gather statistics.

IV. Ranking methods used

Pointwise Mutual Information (PMI) used by Chambers and Jurafsky
Ordered PMI

Bigram probabilities of language modelling

V. Evaluation Metrics

Given an event chain of n events, n cloze tests are run, each time removing 
one of the n events to form a partial script.

Given a partial script as input, accurate event prediction will be one that 
ranks the missing event highly in its output guess list.

Average rank and Recall@N are two such metrics. Better results are with 
lower average ranks and higher Recall@N values. N=50 is used in this 
research.

--Results:--

On Reuters corpus: all chains, 2-skip bigrams and bigram probabilities gave the 
best results.
On Fairy Tale corpus: long chains performed the best.
1-skip and 2-skip bigrams performed equally well.
Mixed results for bigram probabilities and PMI.

Main contributions:

* Used Skip-grams for the first time and showed that it is better than
	n-grams approach used previously.
*  Proposed bigram probabilities method for ranking events in a given partial
	script.  
*  A new evaluation procedure Recall@N is also introduced.

References:
[1] Bram Jans, Steven Bethard, Ivan Vulić, and Marie-Francine Moens. Skip
	N-grams and  Ranking Functions for Predicting Script Events. In
	Proceedings of the 13th Conference of  the European Chapter of the
	Association for Computational Linguistics, 2012. 
[2] http://en.wikipedia.org/wiki/Dependency_grammar Dependency Parsing
[3] http://en.wikipedia.org/wiki/Coreference Coreference resolution

---

In this paper, we extend current state-of-theart research on unsupervised
acquisition of scripts, that is, stereotypical and frequently observed
sequences of events. We design, evaluate and compare different methods for
constructing models for script event prediction: given a partial chain of
events in a script, predict other events that are likely to belong to the
script. Our work aims to answer key questions about how best to (1) identify
representative event chains from a source text, (2) gather statistics from
the event chains, and (3) choose ranking functions for predicting new script
events.  We make several contributions, introducing skip-grams for collecting
event statistics, designing improved methods for ranking event predictions,
defining a more reliable evaluation metric for measuring predictiveness, and
providing a systematic analysis of the various event prediction models.

[pdf]

see also [chambers-jurafsky-11_template-script-extraction-from-text.pdf]

}}



####
@InProceedings{prasse-sawade-12icml_identifying-regular-expressions-in-email-spam,
author =    {Paul Prasse and Christoph Sawade and Niels Landwehr and Tobias Scheffer},
title =     {Learning to Identify Regular Expressions that Describe Email Campaigns},
booktitle = {Proceedings of the 29th International Conference on Machine Learning (ICML-12)},
series =    {ICML '12},
year =      {2012},
editor =    {John Langford and Joelle Pineau},
location =  {Edinburgh, Scotland, GB},
isbn =      {978-1-4503-1285-1},
month =     {July},
publisher = {Omnipress},
address =   {New York, NY, USA},
pages=      {441--448},

---review 2013  Gurpreet Singh Khanuja(10277) ~gurpreet/cs365/hw2/hw2_10277.pdf

This paper actually address the problem of generating the regular 
expressions from the given strings that matches the regular expression 
that a person would have written to correctly identify the given 
message. The motivation behind this was to automate the task of an 
email service postmaster who make regular expressions to blacklist 
email campaigns as writing the regular expression was very time 
consuming process.

So here is the problem setting model:

First we search for a predictive model which maps the set of emails 
which belong to some campaign x to some set of regular expression y.
The output of this predictive model would be as close as possible to the 
regular expression what a human postmaster would have written to 
identify the given email campaign.

fw : x → ŷ

To learn such a predictive model, the training data consist of campaign x 
and regular expression y which was created by the postmaster.
Also the model should be with minimal risk R which minimizes the 
expected loss.

R[fw] = ʃʃ Δ(y, fw(x), x) p(x, y) dxdy

The loss function Δ(y, fw(x), x) measures the syntactic and semantic
deviation between the two regular expressions, fw(x) and the one given by the
postmaster for the campaign x. This can be done by comparing the parse trees
of the two regular expressions.
 
[eqn] 
 
To find the optimal regular expression, we would be computing 

fw(x) = arg maxyϵƳ wT Ψ(x,y)

then we approximate this maximum by the maximum over a 
constrained, finite search space to alleviate the intraceability of the 
problem caused due to to infinite space of all regular expression.

[1]And this constrained search space initially contains an alignment of all 
strings in x which is defined below.

DEFINITION 1(Alignment) – The set of alignments Ax of a batch of strings 
x contains all concatenations in which strings from ∑+ and the wildcard 
symbol “(.*)” alternate, and that generate all elements of x.

Thus after the further computations, the maximization over all y can be 
decomposed into n maximization problems which can be solved in O(n 
x |ƳD| ).

Then the algorithm is optimized by minimizing the regularised empirical 
risk,  for l2 regulariser Ω(w) = (1/2C)||w||2.
And they refer to this learning procedure as Rex-SVM.
References

[1] Prasse, Sawade, Landwher and Scheffer. “Learning to Identify 
Regular Expressions that Describe Email Campaigns”.

Abstract: This paper addresses the problem of inferring a regular expression
from a given set of strings that resembles, as closely as possible, the
regular expression that a human expert would have written to identify the
language. This is motivated by our goal of automating the task of postmasters
of an email service who use regular expressions to describe and blacklist
email spam campaigns. Training data contains batches of messages and
corresponding regular expressions that an expert postmaster feels confident
to blacklist. We model this task as a learning problem with structured output
spaces and an appropriate loss function, derive a decoder and the resulting
optimization problem, and a report on a case study conducted with an email
service.

discussion+video on ICML pages


[pdf]

}}

ROBOTICS


@inproceedings{hart-scassellati-12aaai-mirror-perspective-w-humanoid-robot,
  title={Mirror Perspective-Taking with a Humanoid Robot},
  author={Hart, Justin W and Scassellati, Brian},
  booktitle={Twenty-Sixth AAAI Conference on Artificial Intelligence},
  year={2012}
  annote = {

The robot goes upto a mirror and waves its hand - and is able to tell whether
it is itself or not. 

---review by Valcarcel Xavier ~xavier/cs365/hw2.html

The “mirror test” was invented in 1970 by Gallup in order to measure the
self-consciousness of babies or animals. It consists of putting the
individual in front of a mirror, and if he recognizes the reflection as him
by looking to his body through the mirror, the test has succeeded.  Only very
intelligent animals like chimpanzee dolphins or elephant or human older than
18 months are able to pass this test[1]. An alternative to this test is to
see if the individual can use the mirror as an instrument, some more species
are able to pass this test, but they are still rare.

The goal of this paper is to develop the perspective-taking model in order to
create in the future a system able to pass the “mirror test” . 

If a robot can pass this test only with the observations that it does, it
could be a big step in the way how robots represents themselves in their own
environment, and use mirror for special reasoning. 

--Architecture--

6 modules: 
 - End-Effector Model, 
 - Perceptual Model, 
 - Perspective-Taking Model, 
 - Structural Model, 
 - Appearance Model, and 
 - Functional Model 

This paper develops the Perspective-Taking Model using the working
implementation of the 2 previous models. The goal of the Perspective-Taking
Model is to construct a 3D representation of the objects seen in the
reflection of the mirror, at their real place. In the experiment shown in
this paper, the robot will observe its end-effector through the mirror, and
then with the self-knowledge calculate the position of its end-effector.

To achieve that, the first thing is to determine the position and orientation
of the mirror plane. The robot will move its end-effector (in some position)
in front of the mirror without seen it directly, and will record some data
using the End-Effector Model and the Perceptual Model. The End-effector model
permits to compute the position of the end-effector using the known positions
and angles of the joints in the arm. The Perceptual Model permits to compute
the position of the end-effector by looking at it (using the stereo with 2
cameras). If the robot looks at the reflection of the end-effector, the
Perceptual Model calculates the position of the end effector behind the
mirror. So if we have a sample of know positions of the end-effector, and
virtual positions behind the mirror, we can know how the mirror plane
is. Indeed, the virtual positions in the reflection and the real position of
the end-effector are symmetric about the mirror plane.

When we know the position and the orientation of mirror plane compared to the
2 cameras, we can compute the real position of end-effector only by looking
at the reflection and calculating the symmetry.

In the experiment, they used 50 poses of the arm for the training, and 100
for the testing. The results are quite good, because even with only 10 arm
poses the error distance between the position calculated with the
Perspective-Taking Model and position calculated with the End-Effector Model
is less than 5cm. Those results are really encouraging in order to realize a
robot able to pass the “mirror test” .

[1] 1st page of the paper “Mirror Perspective-Taking with a Humanoid Robot”
    by Justin W. Hart and Brian Scassellati 


---

Abstract
The ability to use a mirror as an instrument for spatial reasoning enables an
agent to make meaningful inferences about the positions of objects in space
based on the appearance of their reflections in mirrors. The model presented
in this paper enables a robot to infer the perspective from which objects
reflected in a mirror appear to be observed, allowing the robot to use this
perspective as a virtual camera. Prior work by our group presented an
architecture through which a robot learns the spatial relationship between
its body and visual sense, mimicking an early form of self-knowledge in which
infants learn about their bodies and senses through their interactions with
each other. In this work, this self-knowledge is utilized in order to
determine the mirror’s perspective. Witnessing the position of its
end-effector in a mirror in several distinct poses, the robot determines a
perspective that is consistent with these observations. The system is
evaluated by measuring how well the robot’s predictions of its end-effector’s
position in 3D, relative to the robot’s egocentric coordinate system, and in
2D, as projected onto it’s cameras, match measurements of a marker tracked by
its stereo vision system.  Reconstructions of the 3D position end-effector,
as computed from the perspective of the mirror, are found to agree with the
forward kinematic model within a mean of 31:55mm.  When observed directly by
the robot’s cameras, reconstructions agree within 5:12mm. Predictions of the
2D position of the end-effector in the visual field agree with visual
measurements within a mean of 18:47 pixels, when observed in the mirror, or
5:66 pixels, when observed directly by the robot’s cameras.

}}

  

@inproceedings{kalakrishnan-righetti-11iros_learning-force-control-policies-compliant-manipulation,
 title =	 {Learning force control policies for compliant manipulation },
author =	 {Mrinal Kalakrishnan and Ludovic Righetti and Peter Pastor
and Stefan Schaal}, 
booktitle =	 {Intelligent Robots and Systems (IROS), 2011 IEEE/RSJ },
pages =	 {4639-4644},
year =	 {2011},
annote = {

uses reinforcement learning to learn 
a task like opening a door (very hard for robots). 
the force control regime is learned with a stochastic control under
reward. 

control policy is represented as a discretized trajectory of desired
end-effector positions, orientations, forces and torques. Positions and
orientations are initialized from a kinesthetic demonstration of the
task.  The required forces and torques cannot be observed during this process,
and are initialized to zero. The policies are then optimized using the Policy
Improvement with Path Integrals (PI2) reinforcement learning algorithm
(Theodorou et al., 2010). This allows acquisition of a suitable force/torque
control policy through trial and error. Policy performance is measured by a
cost function that measures task success and penalizes squared accelerations
of the trajectory. The latter part of the cost function is designed to be
quadratic in the trajectory parameters.

REWARD (pen task) some reward is given whenever the finger angles do not
cross a threshold (pen has not slipped).  Being able to hold on to it longer
gives a better reward, which provides a gradient for the learning algorithm.

--- 2013 Nikhil Aggarwal review ~nikhil/cs365/hw2/LEARNING%20FORCE%20CONTROL%20POLICIES%20FOR%20COMPLIANT%20MANIPULATION.pdf

1) INTRODUCTION: In-order to make really assistive robots, they should be
able to do fine manipulation skills. Complex contact forces with environment
need a controller that can handle forces in meaningful way. Rigid body
dynamics can give robot motion parameters. But once robot is in touch with
the environment than it requires resulting contact
interactions. Reinforcement learning has been an approach to learn
manipulation skills. But those approaches consider position that need to be
controlled. These are potentially dangerous as no consideration for object
position is given. In this paper we present a new approach in which we learn
the forces and torque to be controlled at the end effector combined with
kinematic demonstration. An initial knematic demonstration is given. But so
doesn’t contain information about force and torque. So we do reinforcement
learning through trial and error. “The contributions of this paper are
two-fold:

(1) we demonstrate that learning force control policies enables compliant
    execution of  manipulation tasks with increased robustness as opposed to
    stiff position control, and  
(2) we introduce a policy parameterization that uses finely discretized
    trajectories coupled with a  cost function that ensures smoothness during
    exploration and learning.”

2)POLICY IMROVEMENT WITH PATH INTEGRAL(PI^2): 

Policy improvement with path integral
optimizes control parameters based on a cost function i.e given a parameterized function and a cost 
function depending on state we need to minimise cost of the path. 

3) LEARNING FORCE FEEDBACK CONTROL: 

The policy is initialized using user provided kinematic 
demonstration. The PI2
reinforcement learning algorithm is used to optimise policy and to achieve 
right profile of force/torque through trial and error. Some of the steps involved in improving the 
policy are:

 * Demonstration: Force and torque are set to zero initially as during
   kinematic demonstration they can’t be observed correctly. Robot is
   maximally compliant i.e. gives easily to the contact forces.
 * Cost Function: A automated or user provided cost function must to
   assigned to improve  policies.
 * Execution: PI^2 reinforcement learning algorithm is model free
   reinforcement learning  algorithm so it just optimises control parameters
   subjected to the cost function treating the  intermediate controller as
   black box. 
 * Rollout reuse: We preserve some rollouts from previous iteration so that
   we keep learning from those. But it may be case that in actual rollout
   can’t be generated again. So, in-order to overcome this effect we keep re-evaluating such rollouts.

4) EXPERIMENT: We tested our approach on two manipulation tasks. Both tasks were performed 
using a 7 degree of freedom(DOF) Barrett WAM arm, equipped with a three-fingered Barrett Hand 
and a 6-DOF force-torque sensor at the wrist. Our control law for 7 DOF arm is as following and as 
given in fig 4.

Tarm = Tinv.dyn +Tjoint + Tforce 

*  OPENING A DOOR: The aim of the experiment is to learn a control policy for successfully 
operating a lever door handle and open the door. The trajectory was 10 second long and 
descritized into 100 steps. The cost function used was The immediate cost function at time t 
is: 
rt = 300qdoor + 100qhandle +100qpos +10qorient +0:1qfmag +0:02qtmag +0:02qttrack + 0:01qftrack + 
0:0001X
T
RX

where qdoor and qhandle are the squared tracking errors of the door and handle angles 
respectively, qpos and qorient are the squared tracking errors of the position and orientation of 
the hand, qfmag and qtmag are the squared magnitudes of the desired forces and torques, qftrack
and qttrack are the squared force and torque tracking errors, and X
T
RX is the control cost.

After 110 trial we achieved policy that achieved 100% success.

Grasping a pen: The task is to pick a pen kept on table. More tougher as pen might slip from hand. 
The immediate cost function at time t is:
rt = 100qpen+1:0qftrack+0:5qfingertrack+0:1qfmag+ 0:0001X
T
RX,
where qpen is an indicator cost which is 1 if the pen has slipped out of the hand (as described above), 
qftrack is the squared force tracking error, qfingertrack is the squared finger position tracking error, qfmag is 
the squared force magnitude, and X
T
RX is the control cost

After 90 trials we achieved policy with 100% success. Although uncertainty in orientation was not 
taken to consideration but it could handle some error in orientation as well.

(All figure are taken from the paper.)

---

Abstract: Developing robots capable of fine manipulation skills is of major
importance in order to build truly assistive robots. These robots need to be
compliant in their actuation and control in order to operate safely in human
environments. Manipulation tasks imply complex contact interactions with the
external world, and involve reasoning about the forces and torques to be
applied. Planning under contact conditions is usually impractical due to
computational complexity, and a lack of precise dynamics models of the
environment. We present an approach to acquiring manipulation skills on
compliant robots through reinforcement learning. The initial position control
policy for manipulation is initialized through kinesthetic
demonstration. This policy is augmented with a force/torque profile to be
controlled in combination with the position trajectories. The Policy
Improvement with Path Integrals (PI^2) algorithm is used to learn these
force/torque profiles by optimizing a cost function that measures task
success. We introduce a policy representation that ensures trajectory
smoothness during exploration and learning. Our approach is demonstrated on
the Barrett WAM robot arm equipped with a 6-DOF force/torque sensor on two
different manipulation tasks: opening a door with a lever door handle, and
picking up a pen off the table. We show that the learnt force control
policies allow successful, robust execution of the tasks.  

discussion+video on ICML site: 
http://icml.cc/discuss/2012/642.html
Video. http://www.youtube.com/watch?v=LkwQJ9_i6vQ.

REF: 
Theodorou, Evangelos and Buchli, Jonas, and Schaal, Stefan.
A generalized path integral control approach to reinforcement learning
Journal of Machine Learning Research, pp. 3137{3181, 2010.

kalakris, pastorsa,sschaal @usc.edu
ludovic.righetti@a3.epfl.ch


[pdf]

}}



@article{sharon-stern-12_multi-agent-path-finding-conflict-based,
 title={Conflict-based search for optimal multi-agent path finding},
author={Guni Sharon and Roni Stern and Ariel Felner and Nathan Sturtevant},
journal={AAAI (to appear)},
year={2012}
annote = {

robots must move around in a space without hitting each other.

? project?

---review 2013 Sanchit Gupta ~sanchitg/cs365/hw2/10635_assignment2.pdf

--Objective--

This paper introduces a new search algorithm called Conflict Based Search
(CBS) to solve the multi agent path finding problem (MAPF).The idea is to
evaluate lesser states than A* algorithm and solve the MAPF problem optimally
and efficiently.

--Problem Description--

In MAPF problem, given a graph with a unique start and goal positions of each
agent, the objective is to return a set of actions for each agent to move it
from its start to goal position without conflicting with other agents. An
action can be to wait at the current location or to move to a neighboring
location. Each agent has an associated action at each time step. A conflict
occurs when two or more agents are present in the same location at the same
time step. We want an optimal solution i.e set of actions such that the
cumulative cost function is minimized. The cost function is the summation
over all agents of the number of time steps required to reach the respective
goal locations.  Both move and wait have a cost of 1 unit.

--Description of the CBS algorithm--

CBS is a two level algorithm. At the high level, conflicting paths are found
and constraints are added. At the low level paths are updated so that they
are consistent with the new constraints.

High level-Basically, an OPEN set is maintained in which each node has 3
attributes: cost, solution path set and constraints. The solution path set is
the set of k paths, one for each agent consistent with the constraints on
that agent. Cost is the summation over all single agent path costs. The OPEN
set is prioritized based on node cost. High level removes the best node (min
cost) and checks if its solution path set is conflicting or not by simulating
the paths. If not we are done i.e we just output this nodes attributes, else
on the first conflict it inserts two new nodes into the OPEN set with an
additional constraint to avoid the conflict. Therefore the branching factor
associated with the high level tree is 2. To update the solution path set of
new nodes, the low level of algorithm is invoked. The process continues till
we get the node which satisfies all the constraints.

Low level- Each agent is solved independently i.e ignoring the presence of
all other agents, while maintaining the constraints imposed on its path. A*
algorithm or any other optimal single agent path finder is invoked which
performs a search in the underlying graph to find the optimal path for each
agent satisfying the respective constraints.

--Contributions--

* The distinguishing feature of CBS is that it uses single agent search at
  the low level. This idea is a new one as previous approaches used to
  consider all the agents at once leading to examination of a fairly larger
  number of states.
* In earlier approaches to the MAPF problem, the search tree used to be
  exponential in the number of agents whereas in this approach search tree is
  exponential in the number of conflicts that are encountered during the
  solving process. This results in this algorithm behaving differently than
  other approaches in most cases.
* The performance of CBS depends on the structure of the input graph. In a
  graph with more bottlenecks CBS performs better than purely A* based
  approaches. Experimental results carried out show a speedup of a full order
  of magnitude over previous approaches in such graphs. On the other hand,
  graphs with more open spaces lead to its poor performance.

Complexity of the CBS algorithm

Let the cost of optimal solution be C and number of distinct locations in the
input graph (nodes) be V.

Then the total number of nodes in the OPEN set expanded by CBS is O (2^V*C).

Therefore the total number of low-level nodes expanded is at most (2^V*C)* V*C .

References
* Sharon, G.; Stern, R.; Felner, A. and Sturtevant, N. 2012. Conflict Based
  Search for Optimal Multi-Agent Path Finding. In AAAI, 563-569.
* Wikipedia : http://en.wikipedia.org/wiki/A*_search_algorithm

---

abstract:
In the multi agent path finding problem (MAPF) paths should
be found for several agents, each with a different start and
goal position such that agents do not collide. Previous optimal solvers
applied global A*-based searches. We present
a new search algorithm called Conflict Based Search (CBS).
CBS is a two-level algorithm. At the high level, a search is
performed on a tree based on conflicts between agents. At the
low level, a search is performed only for a single agent at a
time. In many cases this reformulation enables CBS to examine fewer states
than A* while still maintaining optimality.
We analyze CBS and show its benefits and drawbacks. Experimental results on
various problems shows a speedup of
up to a full order of magnitude over previous approaches.


[pdf]

}}

@inproceedings{slowinski-guerin-11_modeling-from-probability-clusters,
 title={Learning regions for building a world model from clusters in probability distributions},
author={Slowinski, W. and Guerin, F.},
booktitle={Development and Learning (ICDL), 2011 IEEE International Conference on},
volume={2},
pages={1--6},
year={2011},
annote = {

learns local region model in a map
unlike kuipers etal - whose algo requires bootstrapping data - is not learning
fully autonomously,
learns using state variables : attempts to learn rule-like generalizations
demonstrated on a 3D game environment.

####COGSCI project? 

---review 2013 rajiv k omar February 26, 2013 ~rajivk/cs365/hw2/Summary_60.tex	


We are given a developing agent which has only sensory inputs and is trying
to learn about the world and act accordingly. The first step in constructing
an internal model of the world would be observing regularities in its sensory
inputs. When the model is in a continuous domain and it is governed by a set
of rules, the agent's first step would be finding intervals of interest of
continuous variables where good rules with reliable predictions can be
defined.  In this paper Slowinski and Guerin are giving an algorithm for
finding such regions by means of finding clusters on approximate probability
distributions\footnote{Approximate probablility distributions is needed as
variables are continuous but memory is finite. It is stored in Compressible
Distributions data-structure for each variable} of continuous sensory
variables. 

The algorithm itself works by\\ 
1) Finding possible causal relations between variables\\
2) Create approximate probablility distribution using Compressible
Distribution data structure for each variable.\\
3) Compressible Distribution is compressed to produce clusters in the
form of real-number intervals.\\
4) Creating general rules.\footnote{Step 4,5,6  are more technical}\\
5) Specialising Rules.\\
6) Check if better rules do not already exists as supersets of generated rules.

Although there already exists methods for our goal, for example a
landmark based method given by Mugan and Kuipers, this approach will
be a new approach for this problem. Also they notice that
cluster-based algorithm places boundries at more natural places than
the other algorithm. While cluster method gives rules providing
clearer general model, landmark method makes a fine grained model more
suitable for planning. They also speculate that it could become basis
for learning higher order rules as the generated regions seem to be closer to
rules used to generate the model for testing.\\\\
This method can be improved further by complementing with other rule
manipulation techniques to produce a fine-grained model. Also the
performance would improve during online learning in a fully
developmental agent, instead of offline learning in the given
case. Also cluster based algorithm can be generelized to 2 or 3
dimensional vector variables.

---

ABSTRACT—A developing agent learns a model of the world by observing
regularities occurring in its sensory inputs. In a continuous domain where
the model is represented by a set of rules, a significant part of the task of
learning such a model is to find appropriate intervals within the continuous
state variables, such that these intervals can be used to define rules whose
predictions are reliable. We propose a technique to find such intervals (or
regions) by means of finding clusters on approximate probability
distributions of sensory variables. We compare this cluster-based method with
an alternative landmark-based algorithm. We evaluate both techniques on a
data log recorded in a simulation based on OpenArena, a three-dimensional
firstperson- perspective computer game, and demonstrate the results of how
the techniques can learn rules which describe walking behaviour. While both
techniques work reasonably well, the clustering approach seems to give more
“natural” regions which correspond more closely to what a human would expect;
we speculate that such regions should be more useful if they are to form a
basis for further learning of higher order rules.


[pdf]

}}

	

@inproceedings{natarajan-joshi-tadepalli-11ijc_imitation-learning-boosting,
title={Imitation learning in relational domains: A functional-gradient boosting approach},
author={Natarajan, S. and Joshi, S. and Tadepalli, P. and Kersting, K. and Shavlik, J.},
booktitle={Proceedings of the Twenty-Second international joint conference on Artificial Intelligence-Volume Volume Two},
pages={1414--1420},
year={2011},
annote = {

####

---review 2013 Ritesh Gautam ~gritesh/cs365/hw2.html

the authors have used the method of Functional-Gradient 
Boosting by Friedman which is similiar to the Adaboost algorithm by Freund and 
Schapire for supervised learning in relational domains. 

--Tree Boosting for Relational Imitation learning (TBRIL)--

The goal of the algorithm is to produce a policy for our learning model using
as input the set of states (S), the set of actions (A) and the set of
features (f) which describes the relation between the states and the action
to be used at any time,where a policy is a mapping from features to actions
given the time step i and the trajectory j.

They use a training set to provide the algorithm with a policy to learn and
later produce it for other instances of features.

The algorithm has a function TBoost that iterates over actions generating
examples for each action using the GenExamples function.  The generated
examples are provided to the regression tree learner to calculate the value
of the current model in each step.finally after all the gradient steps we get
the changed value of the gradient, Λ for each action.

---

abstract:

Imitation learning refers to the problem of learning how to behave by
observing a teacher in action.  We consider imitation learning in relational
domains, in which there is a varying number of objects and relations among
them. In prior work, simple relational policies are learned by viewing
imitation learning as supervised learning of a function from states to
actions. For propositional worlds, functional gradient methods have been
proved to be beneficial. They are simpler to implement than most existing
methods, more efficient, more naturally satisfy common constraints on the
cost function, and better represent our prior beliefs about the form of the
function. Building on recent generalizations of functional gradient boosting
to relational representations, we implement a functional gradient boosting
approach to imitation learning in relational domains. In particular, given a
set of traces from the human teacher, our system learns a policy in the form
of a set of relational regression trees that additively approximate the
functional gradients.  The use of multiple additive trees combined with
relational representation allows for learning more expressive policies than
what has been done before. We demonstrate the usefulness of our approach in
several different domains.

[pdf]

}}



@inproceedings{kober-oztop-11ijc_reinforcement-learning-for-robot-movements,
 title={Reinforcement learning to adjust robot movements to new situations},
author={Jens Kober and Erhan Oztop and Jan Peters},
booktitle={Proceedings of the 22nd IJCAI Volume Three},
pages={2650--2655},
year={2011},
annote = {

In training a robot to throw darts (or hit a table-tennis ball), the system
learns a function from its state space (initial pose of robot, target
location) to the motion parameter space (vel,angle).  This map helps it
generalize dynamic plans from one set of parameters to similar others.
Cost-regularized Kernel Regression finds solutions with the same final
performance two orders of magnitude faster than the finite difference
gradient (FD) approach and twice as fast as the reward-weighted regression.

--- review 2013 Pratap Bhanu Solanki ~prabhanu/cs365/hw2/paper_review.pdf

INTRODUCTION: This paper focuses on modulating the elementary movement of
Robot’s actuators so that it can learn to adjust to new circumstances which
are slightly different than the old training cases. Here Robot does not have
to learn the motor parameters again. They introduced a reinforcement learning
algorithm based on kernelized version of the reward weighted regression.

Methodology:

1. Turning to Cost-regularized Kernel Regression(CrKR): By inserting the
reward weighted regression (RWR) solution into system output equation of
basis function the reward weighted regression is turned into Cost–Regularized
Kernel Regression.

2. The rewards are assumed to be positive and hence are transformed into cost
by inverse relation.

Now the cost is directly proportional to the distance from the desired
optimal solution at a point.

3. Meta-parameter learning by Reinforcement learning: First of all this
algorithm receives three inputs :

a. Motor primitives (of Each degree of freedom)

b. Initial example containing system states, meta-parameter and Cost

c. Scaling parameter

For Illustration of this algorithm an example of 2D dart throwing task is
used. Here state ‘s’ is height of the impact point. Meta-parameters are
velocity and angle of the leaving dart.

Figure: 2D Dart throwing task(Courtesy [1])

The cost here is error between desired goal and impact point and the
launching velocity of dart. Initially meta-parameters are sampled from
current state. Then from the outcome of trial cost is calculated and
accordingly policy is updated. Similar process is followed later in each
iteration.

The dart throwing is tested with a simulated robotic arm and also with a real
Barett WAM.

Results Evaluation: In simple cannon shooting they have bench marked the
Reinforcement Learning by CrKR approach against Finite difference gradient
estimator and reward-weighted regression.

In real complete framework, a complex dart throwing task called ‘Around the
Clock’ was taken. Here also CrKR outperformed RWR which can be observed from
figure 2.

Figure: 2 Cost function for Dart throwing Task (Courtesy [1])

Table Tennis: The complete framework has also been tested for hitting a table
tennis ball in the air. The meta-parameters are joints positions and
velocities for all seven degrees of freedom plus a training parameter. These
15 parameters are learned and optimized using Gaussian kernel. At the
beginning the robot misses 95% of the balls but finally it hits almost all
balls (figure 3)

Figure 3: Cost function of Table Tennis task (Courtesy [1])

Conclusion: It can be seen form the results that the necessary mapping
required at the time of new situation can be learn using a CrKR. This
algorithm outperforms RWR and other preceding methods.

Reference:

1. Reinforcement Learning to adjust Robot Movements to new situations. Kober
, Oztop, Peters Proceedings of the 22nd IJCAI Volume
  Three,2650--2655,2011.(Given paper) 


---

Abstract:
Many complex robot motor skills can be represented using elementary
movements, and there exist efficient techniques for learning parametrized
motor plans using demonstrations and self-improvement. However with current
techniques, in many cases, the robot currently needs to learn a new
elementary movement even if a parametrized motor plan exists that covers a
related situation. A method is needed that modulates the elementary movement
through the meta-parameters of its representation. In this paper, we describe
how to learn such mappings from circumstances to meta-parameters using
reinforcement learning. In particular we use a kernelized version of the
reward-weighted regression. We show two robot applications of the presented
setup in robotic domains; the generalization of throwing movements in darts,
and of hitting movements in table tennis. We demonstrate that both tasks can
be learned successfully using simulated and real robots.


[pdf]

}}



@inproceedings{erdogan-veloso-11ijc_action-selection-robocup,
title={Action Selection via Learning Behavior Patterns in Multi-Robot Systems},
author={Can Erdogan and Manuela Veloso},
booktitle={IJCAI Proceedings-International Joint Conference on Artificial Intelligence},
volume={22},
number={1},
pages={192},
year={2011},
abstract = {

---review 2013 sugam anand

This paper presents a technique which tries to capture the strategies which
influences the behavior patterns of teams in Robocup Small Size
league.Moreover It also presents an algorithm to select actions to counter
attack the oppostion moves during an episode .The paper is divided into
mainly two parts first they formalise a model which represents the location
and temporal data of robots using trajectories in 2-d image .In second part
they present a notion of trajectory similarity . They model the behaviour of
oppostion by implementing a variant of agglomerative hierarchical clustering
.On basis of this clustering it classifies the behaviour as it occurs in
realtime. and also gives algorithm which autonomously generates moves.Authors
have implemented they research on CMDragons -The SSL Team of CMU and the
results were outstanding.

Note :They have made nonresponsive assumption which means that opposition
will not try to adapt to different strategy (which is influenced by us ) when
we are trying analyize its behaviour patterns and will only be executing a
preplanned strategy

Some Definitions

behaviour pattern : A cluster of similiar sets of trajectories as obtained in
realtime or present from previous match database

Trajectory:A vector of points recorded within a time frame tn-t0 such that

Episode : E[t0,tn] is a time frame during which they want to analyze the
behaviour of opposition and hence deducing its strategies.It begins at t0
when free or indirect kick is given by the referee box until stop commands
tn.

Referee calls : F:free kick I:indirect kick p:penalty k:kickoff S:start
s:stop

Active Agent:It basically corresponds to those robots which either handles
the ball during starting of episode or those who go into opposite half to
align themselves to receive pass.

Trajectory Analysis

The above defined trajectory is represented as 2-d images and then shape
matching algorithms can give notion of similarity between sets of
trajectories. The shape matching approach used is based on hausdorff distance
metric.

Consider two set of trajectories Sx and Sy over some time frame Tx and Ty,
for measuring similarity between them we calculate hausdorff distance between
them as E(p,q)= euclidean distance between points p and q Then they have
formulated they similarity function considering optimisation and inversion

Clustering Techniques

With the help of similarity function they partition the sets of trajectories
into clusters each representing a behaviour.They have also mapped each
cluster with an apprppriate action They have also modified their similarity
function to classify a set of trajectories which starts at t0 and and ends at
tk >tn. Because it lies in two timeframes for which we have clusters there
lies a problem whether to classify it in ci or ci+1

Now to measure the distance between a set of trajectories S and some
clustering ci of cluster C,they compute based on this they classify S into
Some cluster. Moreover, to compare two sets of trajectories stretching over
diffrent timeframes,if they have to be compared with one complete and other
incomplete,they applied a heuristic and modified similarity function
	
[red] is function which shortens the complete trajectory so that it can be
compared to the short one. Due to small dataset they had to implement a
variant of hierarchical clustering because outliers can significantly affect
the partitioning .Also density based clustering cannot cluster such data and
gives large variance in densities.

Classification in Realtime And Selecting Actions

The paper finally concludes with explanation of classification during game
and Algorithm to select optimum actions. The SelectAction algorithm takes as
input current state of the game and it then extracts the behaviour patterns
from the state and it classifies the pattern into cluster according to it and
this cluster has a action associated with it which is then executed if
applicable. Functions used

getObservedPattern : Returns the behaviour pattern observed during an
episode.

onlineClassifier : Classifies the behaviour pattern into cluster using
existing database

getOptimalAction : Returns the most appropriate action which could be
performed in that scenario.

Applicable : Returns the feasiblity of above returned action ie ,if it could
be executed before it becomes outdated.

With the nonresponsive assumption they have been able to get success rate of
70-80% .Here success is associated with successful interception of opposition
pass.

Comments

Since they assumed that opposite team will not modify its strategies during
gameplay which is fair assumption but if we force to have matches between
teams with same algorithms then it might lead to "chicken and egg "problem.

Both team will try to adjust its passes with the history.And the passing team
will know about the moves which was intercepted in previous scenario ,hence
now both will react in strange manner


---

The RoboCup robot soccer Small Size League has been running since 1997 with
many teams successfully competiting and very effectively playing the
games. Teams of five robots, with a combined autonomous centralized
perception and control, and distributed actuation, move at high speeds in the
field space, actuating a golf ball by passing and shooting it to aim at
scoring goals. Most teams run their own pre-defined team strategies, unknown
to the other teams, with flexible game-state dependent assignment of robot
roles and positioning. However, in this fast-paced noisy real robot league,
recognizing the opponent team strategies and accordingly adapting one's own
play has proven to be a considerable challenge. In this work, we analyze
logged data of real games gathered by the CMDragons team, and contribute
several results in learning and responding to opponent strategies. We define
episodes as segments of interest in the logged data, and introduce a
representation that captures the spatial and temporal data of the multi-robot
system as instances of geometrical trajectory curves. We then learn a model
of the team strategies through a variant of agglomerative hierarchical
clustering. Using the learned cluster model, we are able to classify a team
behavior incrementally as it occurs. Finally, we define an algorithm that
autonomously generates counter tactics, in a simulation based on the real
logs, showing that it can recognize and respond to opponent strategies.


[pdf]

}}



@inproceedings{macalpine-stone-12_omnidirectional-humanoid-robocup-winner,
 title={Design and optimization of an omnidirectional humanoid walk: A winning approach at the RoboCup 2011 3D simulation competition},
author={Patrick MacAlpine and Samuel Barrett and Daniel Urieli and Victor Vu and Peter Stone},
booktitle={Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence (AAAI-12)},
year={2012},
annote = {

---review 2013 harsh vardhan reddy ~gujjula/cs365/hw2.html

1 Introduction
2 Robo Soccer Domain Description
3 Walk Engine
3.1 Calculation the position of feet w.r.t to torso
4 Control of Walk Engine
5 Optimization of parameters of Walk Engine
5.1 Drive the ball to goal optimization
5.1.1 Go to Target parameter set
5.1.2 Sprint parameter set
5.1.3 Positioning parameter set
6 Results

1 Introduction

We discuss about learning of a humanoid walk in Robot Soccer Domain. We
divide the agent's behaviour into simpler subtasks and then optimize the
parameters for these subtasks. The simulation environment may not represent
exact real world but it will test tasks as humanoid locomotion. One main
positive of working in simulation is we can do lot of experimentation wihout
any wear and tear problems on physical devices. In later sections we describe
Parameterized omnidirectional walk engine. The Stability and speed of robot's
walk, such as step height, length and timing can be varied by these
parameters.

It is not possible to learn parameters for all possible walking directions.
So the robot learns parameters for simple subtasks needed to play soccer :

Walking to the target.
Dribbling a soccer ball.
Higher-level behaviours can be obtained by combining the smaller subtasks like Dribbling the ball to goal.  We discuss in later sections about how to split the higher-level task (i.e) omnidirectional humanoid locomotion into subtasks, and optimize the parameters for those subtasks.

2 Robo Soccer Domain Description

A team of 9 autonomous humonoid robots play soccer in an environment which is
physically realistic. The RoboCup 3D simulation enviroment is supported by
SimSpark ( Physical multiagent system simulator) which uses Open Dynamic
Engine (ODE) library for showing realistic of physics in simulator. The
agent's are ldeberan Nao robot of height 57cm and mass of 4.5kg and has 22
degrees of freedom: 6 in each leg, 4 in each arm and 2 in the neck.

The interation between agent and simulator is acheived by sending torque
command and receiving perceptual information. Joint perceptors and effectors
are equipped to the agent to control its hinge joints. Joint perceptors give
noise free angular measurements for every simulation cycle(20ms) to the
agent. Joint effectors allows the agent to move the joint by specifying
torque and direction. For every third simulation cycle(60ms) visual data of
the environment is provided to the agent by noise measurements of the
distance and angle to object over a restricted vision(120 degrees). The
communication between the agents is done in every other simulation
cycle(40ms) by messages of size 20 bytes.

                 
            Figure1:Nao humanoid robot(left) Figure2: a view of soccer field(right)

3 Walk Engine

To request continuous velocities in the directions, approaching to destination which change frequently (ball) more quickly and smoothly, omnidirectional walk is crucial. UT Austin Villa 2011 team first tested on physical Nao robots before transforming to simulation for the RoboCup 3D simulation league which lead to some important features:

If robot is unstable then decrease in step sizes will help in attaining the balance
There is a significant delay between commands and sensed changes
The walk Engine was based on the real Nao robot (Graf et al.2009) but there are some variations:
Sigmoid Function for forward component
Proportional control to adjust the step size
Optimize the parameters
The motion of limbs is based on sinusoidal functions with feedback control
Trajectory for the body is given by walk from then we can calculate the feet's position w.r.t to body's location.
            x - forward direction             z - vertical direction 
            y - sideways direction           θ - rotating around z- axis 
The trajectory is calculated using double linear inverted pendulum model where center of mass is swinging over the stance foot.

               
            Figure3:Workflow for generating joint commands from the walk engine

3.1 Calculation the position of feet w.r.t to torso:

[...]

4 Control of Walk Engine

Agent moving towards a set target point in the field:

The approach is both moving and turning towards a target at the same time:

	It is found the agent which turns won over the agent of same version
	but does not turn the face to target by 0.7goals across 100 games

	It is also found that the agent which moves immediately towards the
	target against a version that will turns to a orientation where it
	can go with maximum forward velocity by 0.3goals across 100 games. 

Dribbling the ball:
	The agent first turns behind the ball before running towards the ball
	and move the ball to desired dribble direction 

5 Optimization of parameters of Walk Engine

Earlier we said that there are more than 40 parameters for the walk engine.
Initial Agent is the agent which uses the walk whose parameters are based on
our understanding of the system.  We optimize the parameters because the
initial parameter values are slow, using Covariance Matrix Adaptation
Evolution Strategy (CMAES) algorithm. CMA-ES generates candidates where each
is evaluated w.r.t to fitness measure. The mean of the multivariate Gaussian
distribution (all candidates) is reevaluated as the weight average of the
candidates with highest fitness. We optimize 14 parameters as optimizing the
40 parameters is not feasible while fixing the others.  Those 14 parameters
are choosen such that they have high impact on speed and stability of the
robot. The 14 parameters are given in the table:

We break the agent's game behaviour into set of subtasks and optimize the
parameters for each task.

5.1 Drive the ball to goal optimization

start task is driveBallToGoal where robot must drive the ball as far as it
can towards the goal which are paced on the field within 30 simulated
cycles. The fitness of the parameter set is the distance that ball travelled
towards goal. We refer to the optimized agent as DriveBallToGoal which shows
improvement by a factor of 15 over initial agent.  DriveBallToGoal agent won
against initial agent by an average of 5.54goals with standard error of 0.14
across 100 games.  While optimizing the parameters we found that the agent
was unstable when stopping at a target or circuling around the ball to
dribble. driveBallToGoal is not representative of these situations in
simulation game.

5.1.1 Go To Target parameter set

we replace the optimization procedure for driveBallToGoal task with a new
goToTarget subtask.  This task has many obstacles which were different from
target to target and each taraget is active on field one at time for a period
of time.  The agent is rewarded based on the distance covered towards the
active target.  Extra rewards is the distance that agent would have covered
for the remaining time.  There is also penalty which is the distance
travelled for the stop positions. Fall is 5 if robot falls 0 otherwise in the
following equations

            dtarget - distance travelled by toward the target.
            dmoved - total distance moved.
            ttotal - total duration a target is active
            ttaken - time to reach the target and is ttotal if not reached
            rewardtarget = dtarget( ttotal/ttaken) - Fall 
            rewardsstop = -dmoved - Fall

There are quick changes of target and direction for aiming on the reaction
speed and long duration targets will have impact on improving the straight
line speed and stop targets will make agent to stop quickly while remaining
stable.  The trajectories that agent follows :

forward,backward,left,right long walks
curvilinear walk
Changing the directions quickly
forward,backwards,left,right stops and moving
changing from left-to-right movement to right-to-left movement
Changing the targets very fast to simulate a noisy target
back and forth weaving at 45 degree angles
To find out stability changing the directions extremely
Fast movements with stopping combined
changing quickly between walking left and right
coiled walk both clockwise and counter-clockwise
the GoToTarget agent won over the DriveBallToGoal agent by an average of 2.04 goals with a standard error of .11 across 100 games.

5.1.2 Sprint parameter set

The forward speed of the robot is improved by optimizing parameter set for
walking forward for 10 seconds.  The agent was learning parameters for
walking better than using the goToTarget parameter set. But the problem here
is when the robot is changing from forward walk to goToTarget parameter sets,
robot became unstable and fell over.  The reason for this is the parameter
sets learned seperately which results incompatible.  So the solution for this
is by running the goToTarget subtask optimization again,but this time the
goToTarget parameter set is fixed and learned a new parameter set(sprint
parameter set). When agent's orientation is around 15degrees to the target
then agent uses this parameter set. The output of the goToTarget parameter
set is given as input for sprint parameter.  By learning like that the new
Sprint agent was stable switching between two parameter sets and its speed is
also increased. The Sprint agent won over GoToTarget agent by an average of
0.09 goals with a standard error of 0.07.

5.1.3 Positioning parameter set

when positioning to dribble the ball the agent was still little slow. the
reason is positioning around the ball includes the more side-stepping to
circle the ball.  The third parameter set(positioning parameter set) is
learned which requires creating a new optimization driveBallToGoal in which
evaluation is done by the distance robot dribble's over 15 seconds. when the
agent is 0.8 meters from the ball positioning parameter set is used and is
given from the values of goToTarget parameter set. The parameter sets of
goToTarget and sprint are fixed then the optimization will include
transitions between 3 parameter sets. the third layer of learning is the
ouput of previously learned parameter sets are given as input to positioning
parameter set.  The final agent won over the sprint agent by an average of
0.15 goals with a standard error of 0.07 across 100 games.
             
Figure5:UT Austin Villa walk parameter optimization progression. Cricles
represent parameter sets used by agent during the optimization progression
while the arrows and associated labels above them indicate the optimization
tasks used in learning. Parameter sets are the following: I = initial, T =
goToTarget, S = sprint, P = positioning.

6 Results

	GoToTarget > DriveBallToGoal > Initial
                           
           Figure6:Game results of agents with different walk parameter sets.
           Entries show the average goal difference (row - column) from 100 ten minute games.  Values in parentheses are the standard error.


[Beasts winner of robocup 2011, apollo 3D, by 1.45 goals on average, over 100
games] 
              
           Figure7:Full game results, averaged over 100 games. Each row is an agent from the RoboCup 2011 
           competition, with the ranks in there.  Entries show the goal difference from 10 minute games versus our final optimized agent.
           Values in parentheses are the standard error.

sources

Figure1 and Figure2 are taken from
http://en.wikipedia.org/wiki/RoboCup_3D_Soccer_Simulation_League 
Remaining Figures are from reasearch paper
Formulae are also from reasearch paper

---

abstract:
presents the design and learning architecture for an omnidirectional walk
used by a humanoid robot soccer agent acting in the RoboCup 3D simulation
environment. The walk, which was originally designed for and tested on an
actual Nao robot before being employed in the 2011 RoboCup 3D simulation
competition, was the crucial component in the UT Austin Villa team winning the
competition in 2011. To the best of our knowledge, this is the first time
that robot behavior has been conceived and constructed on a real robot for
the end purpose of being used in simulation. The walk is based on a double
linear inverted pendulum model, and multiple sets of its parameters are
optimized via a novel framework. The framework optimizes parameters for
different tasks in con- junction with one another, a little-understood
problem with substantial practical significance. Detailed experiments show
that the UT Austin Villa agent significantly outperforms all the other agents
in the competition with the optimized walk being the key to its success.


[pdf]

}}



@article{hawes-klenk-12_spatial-regions-based-on-context,
title={Towards a Cognitive System That Can Recognize Spatial Regions Based on Context},
author={Nick Hawes and Matthew Klenk and Kate Lockwood and Graham S. Horn, G.S. and Kelleher, J.D.},
year={2012}
booktitle={Twenty-Sixth AAAI Conference on Artificial Intelligence},
annote = {

a phrase such as "the front of the classroom" is determined not by geometry
per se, but in terms of the objects present in the room (chairs, a desk, a
whiteboard), their role in this context (seats for students to watch a
teacher who writes on the whiteboard) and their configuration in space (the
seats point toward the whiteboard). These regions as are named
context-dependent spatial regions (CDSRs), and an attempt is made with a
mobile robot to understand them. 

####

---review 2013 Ayush Varshney (10183)  February 26, 2013 

Synopsis

The paper describes a cognitive system that can recognize spatial regions
based on the context they are presented to the system. To understand the
problem take this example:

Command: Find the position for the refree in a stadium

The solution depends on the game being played in the stadium which in itself
can be determined by the relative position of players, ball, etc. Such areas
are referred to as Context Dependent Spatial

Regions CDSRs in this paper. The cognitive system described in this paper can
learn CDSRs by using semantic labels, Qualitative Spatial Representations
(QSRs) and analogy.

Introduction

The paper describes, in specific the problem of making a bot find the front
of the classroom based on the configuration of objects such as tables and
chairs present in the room (‘the context’). The work is based on Dora, a
mobile cognitive robot with a pre-existing multi-layered spatial model [1].

Dora uses vision to recognize pre-trained 3D object models and prepares a
metric map which basically a set of lines in 2D global coordinate frame.

The strengths of 8 spatial relations between that object (detected by Dora)
and each of the objects adjacent to it (adjacency being determined using a
Voronoi Diagram [2]) is calculated. For this, the spatial relationships (s)
between adjacent objects is calculated as:

[EQN]

Where

d: the distance between the landmark’s centroid and the adjacent object’s
location

ɵ: the inner angle between the direction vector of the relation and the
vector from the origin (the landmark’s centroid) to the neighbour’s location.

This helps in QSR Extraction.2

Methods Used

• Representing CDSRs

Anchor Points are used to represent CDSRs [3]. These are symbolic
descriptions which link a conceptual entity to a perceived entity. The
perceived objects are the room and the ones recognized by Dora. Anchor points
are created from the entities perceived using Unary Functions like
YMinXFewestFn. Example of CDSR representation:

* Using Analogy

To save the user from the pain of training for all the contexts the system
uses analogy. For example if the Bot is trained where the front of a
classroom after perceiving a class and objects in it, it is capable of
recognizing the front – of any other class of the same structure.
Structure-Mapping Engine [4] is employed to perform analogical matching in
the system.

Example System Run

• Setup
The approach was simulated on six class rooms and two simulated studio
apartments. The base and target representations were created by making Dora
wander around two different classrooms and a map created. QSRs are generated
using map and object data.

• Evaluation [5]3
These results support the hypothesis that anchor points can provide a
symbolic representation on top of sensor data for context-dependent spatial
regions, and, when combined with qualitative spatial relations, they
facilitate learning from a single example through analogical transfer

References 
[1]M.H.C. Gretton; M. Gobelbecker. “Dora, a Robot Exploiting Probabilistic Knowledge under Uncertain 
Sensing for Efficient Object Search”
[2] Book: Spatial Tessellations – Concepts and Applications of Voronoi Diagrams: Second Ed. Wiley ISBN 0-
471-98635-6 Page: 441-446
[3] Klenk, M.; Hawes, N.; Lockwood, K.; Horn, G.S.; Kelleher, J.D.; "Using Anchor Points to Define and Transfer 
Spatial Regions Based on Context"
[4] Falkenhainer, B.; Forbus K.D.; Gentner, D.; “The Structure-Mapping Engine: Algorithm and Examples”
[5] Klenk, M.; Hawes, N.; Lockwood, K.; Horn, G.S.; Kelleher, J.D.; "Towards
a Cognitive System That can Recognize Spatial Regions Based on Context"

abstract: 
In order to collaborate with people in the real world, cognitive systems must
be able to represent and reason about spatial regions in human
environments. Consider the command “go to the front of the classroom”. The
spatial region mentioned (the front of the classroom) is not perceivable
using geometry alone. Instead it is defined by its functional use, implied by
nearby objects and their configuration. In this paper, we define such areas
as context-dependent spatial regions and present a cognitive system able to
learn them by combining qualitative spatial representations, semantic labels,
and analogy. The system is capable of generating a collection of qualitative
spatial representations describing the configuration of the entities it
perceives in the world. It can then be taught context-dependent spatial
regions using anchor points defined on these representations. From this we
then demonstrate how an existing computational model of analogy can be used
to detect context-dependent spatial regions in previously unseen rooms. To
evaluate this process we compare detected regions to annotations made on maps
of real rooms by human volunteers.

[pdf]

}}


ATTENTION


  

@inproceedings{jiang-crookes-12_visual-saliency-estimation-via-manifold-learning,
 title={Visual Saliency Estimation through Manifold Learning},
author={Richard M. Jiang and Danny Crookes},
booktitle={Twenty-Sixth AAAI Conference on Artificial Intelligence},
year={2012}

---review 2013 Salman Mohammad ~salmanmd/cs365/hw2/salman_AI.pdf

Visual salience (or visual saliency) is the distinct subjective perceptual
quality which makes some items in the world stand out from their neighbors
and immediately grab our attention. This paper presents a better algorithm
for detecting salient visual features in complex images than the previously
available methods.

--OBJECTIVES/MOTIVATION--

 * Most early work makes more effort to build saliency models on low-level
	image features based on local contrast. These methods investigate the
	rarity of image regions with respect to (small) local neighborhoods.
 * Recent efforts have been made toward global contrast based saliency
	estimation, where saliency of an image region is evaluated at the
	global scale with respect to the entire image.
 * From previous research it is known that producing a correct salient map is
	sensitive to the size of regions, and a manual fine tuning of the
	segmentation is a prerequisite for some challenging images.

In summary, it has been observed that most global approaches depend on either
regions from image segmentation, or blocks with specified sizes.

While image structures at global scale are usually an important factor for
producing salient stimuli, this paper presents a novel approach for saliency
modeling( that too manifold based saliency estimation)

--NEW ALGORITHM--

In the proposed approach, an image is considered as a manifold of visual
signals (stimuli) spreading over a connected grid, and local visual stimuli
are compared against the global image variation to model the visual
saliency. With this model, manifold learning is then applied to minimize the
local variation while keeping the global contrast, and turns the RGB image
into a multi channel image. After the projection through manifold learning,
histogram based contrast is then computed for saliency modeling of all
channels of the projected images, and mutual information is introduced to
evaluate each single channel saliency map against prior knowledge to provide
cues for the fusion of multiple channels. In the last step, the fusion
procedure combines all single channel saliency maps according to their mutual
information score, and generates the final saliency map. In our experiment,
the proposed method is evaluated using one of the largest publicly available
image datasets.

--CONCLUSION--

Thus a robust manifold-based saliency estimation method has been proposed for
robotic vision to capture the most salient objects in the observed scenes. An
image is considered as a manifold of visual signals (stimuli) spreading over
a connected grid, and projected into a multi-channel format through manifold
learning. Histogram-based saliency estimation is then applied to extract the
saliency map for each single channel, respectively, and a fusion scheme based
on mutual information is introduced to combine all single-channel saliency
maps together according to their mutual information score.

--CONTRIBUTION--

The proposed method is evaluated using a well-known large image dataset. The
experimental results validated that our algorithm attained the best prevision
and recall rates among several state-of-art saliency detection methods.

Furthermore, the proposed method has been demonstrated on a test video to
show its potential use for robotic applications, such as tracking a moving
target in an arbitrary scene while the camera is moving with a robot. The
experimental results show that the proposed approach can successfully
accomplish this sort of tasks, revealing its potential use for similar
robotic applications.

---

Abstract
Saliency detection has been a desirable way for robotic vision to find the
most noticeable objects in a scene. In this paper, a robust manifold based
saliency estimation method has been developed to help capture the most
salient objects in front of robotic eyes, namely cameras. In the proposed
approach, an image is considered as a manifold of visual signals (stimuli)
spreading over a connected grid, and local visual stimuli are compared
against the global image variation to model the visual saliency. With this
model, manifold learning is then applied to minimize the local variation
while keeping the global contrast, and turns the RGB image into a multi
channel image. After the projection through manifold learning, histogram
based contrast is then computed for saliency modeling of all channels of the
projected images, and mutual information is introduced to evaluate each
single channel saliency map against prior knowledge to provide cues for the
fusion of multiple channels. In the last step, the fusion procedure combines
all single channel saliency maps according to their mutual information score,
and generates the final saliency map. In our experiment, the proposed method
is evaluated using one of the largest publicly available image datasets. The
experimental results validated that our algorithm consistently outperforms
the state of the art unsupervised saliency detection methods, yielding higher
precision and better recall rates. Furthermore, the proposed method is tested
on a video where a moving camera is trying to catch up with the walking
person a salient object in the video sequence. Our experimental results
demonstrated that the proposed approach can successful accomplish this task,
revealing its potential use for similar robotic applications.


[pdf]

}}




      @inproceedings{borji-itti-12_object-based-top-down-attention,
       title={An Object-Based Bayesian Framework for Top-Down Visual Attention},
      author={Ali Borji and Dicky N. Sihite and Laurent Itti},
      booktitle={Twenty-Sixth AAAI Conference on Artificial Intelligence},
      year={2012},
      annote = {

      We introduce a new task-independent framework to model top-down overt visual
      attention based on graphical models for probabilistic inference and
      reasoning.  We describe a Dynamic Bayesian Network (DBN) that infers
      probability distributions over attended objects and spatial locations
      directly from observed data.  Probabilistic inference in our model is
      performed over object-related functions which are fed from manual annotations
      of objects in video scenes or by state-of-theart object detection
      models. Evaluating over 3 hours (appx. 315;000 eye fixations and 12;600
      saccades) of observers playing 3 video games (time-scheduling, driving, and
      flight combat), we show that our approach is significantly more predictive of
      eye fixations compared to: 1) simpler classifier-based models also developed
      here that map a signature of a scene (multi-modal information from gist,
      bottom-up saliency, physical actions, and events) to eye positions, 2) 14
      state-of-theart bottom-up saliency models, and 3) brute-force algorithms such
      as mean eye position. Our results show that the proposed model is more
      effective in employing and reasoning over spatio-temporal visual data.


      [pdf]

      }}

BRAIN-INTERFACE


@inproceedings{chung-cheung-11ijc_hierarchical-brain-computer-interface,
 title={A hierarchical architecture for adaptive brain-computer interfacing},
author={Mike Chung and Willy Cheung and Reinhold Scherer and Rajesh P. N. Rao},
booktitle={Proceedings of the 22nd IJCAI Volume Two},
pages={1647--1652},
year={2011},
abstract = {

---review 2013 rahul meena ~rahulme/cs365/hw2.html

Summary

Brain-Computer interfaces (BCIs) provides a path for communication between
the brain and an external device. Most commonly BCIs are used to control
device such as cursors and robots using brain signals. 

two types of BCIs models: 

- Non-invasive BCIs : e.g., those based on electroencephalographic (EEG)
	signals recorded from the scalp, suffer from low signal-to-noise
	ratio which limits the bandwidth of control.

- Invasive BCIs allow fine-grained control but can leave users exhausted
	since control is typically exerted on a moment-by-moment basis. In
	this paper, we address these problems by proposing a new adaptive
	hierarchical architecture for brain-computer interfacing. The
	approach allows a user to teach the BCI new skills on-the-fly; these
	learned skills are later invoked directly as high-level commands,
	relieving the user of tedious low-level control.

Proposes a new adaptive hierarchical architecture for BCI - can performs
	uncertainty-guided decision for real-world BCI applications.

hierarchical EEG-bases BCI overcomes the limitations
of non-invasive. ... lets the user to teach the system new and useful tasks.

The low-level actions are first learned and later semi-autonomously are
executed using a higer-level command. 

For example: left, right, stop are low-level commands and the command "Go to
kitchen" for a semi-autonomous mobile robot is a high level command. 

Once the command has been learned, these high level commands relievs the user
of tedious low-level and moment-by-moment controls. This also allows the
ability to multi-tasking for the robot. This architecture uses Gaussian
processes (GPs) for learning high-level commands.

This hierarchical BCI system has three componets such as: 

1. SSVEP-based BCI,
2. Hierarchical Adaptive Menu System and 
3. Robot Application. 

The steady state visual evoked potential(SSVEP)-based BCI is an EEG-based
BCI.

The hierarchical menu is the interface which the user uses to interact with
the hierarchical learning system. It displays the available commands for the
hierarchical learning system, which are selected using SSVEP-based BCI. The
selection for the command is made by focusing on the desired command in the
menu.

The main objective of this menu is to allow user to choose options :
Train (where the system is tought a new command or modify the existing
command) and Test (which allows the user to select a task that was previously
learned by the system). The logged data and learned model are stored locally,
and the user can update a selected skill with more demonstrations when
needed, improving performance over time. When the option is selected by BCI,
it sends its output to the hierarchical menu system. This hierarchical menu
system sends a command to the robot and switched to the next menu. While
using GPs, when the uncertainty in a given region of task space is too high,
the BCI switches to user control for further guidance rather than continuing
to execute the potentially dangerous high-level command.

Two experiments were conducted to study the proposed BCI. In the first
experiment the RBF networks for learning were used and participants were
instructed to navigate a robot from its initial position to an assigned goal
position first by low-level commands and then by high-level commands. The
results showed that performance by using the hierarchical BCI was better than
the low-level commands. In the second experiment instead of RBF, GPs were
used for learing the commands which also provided a way to identify if the
learned command was safe to execute or not. The extra time left for the user
after executing the high-level commands was used to do give an another
command to facilitate multitasking by the robot. It was observed that the
success rate was higher when GP-based BCI was used instead of RBF networks.

The paper confidently states that "Our results provide a proof-of-concept
demonstration that hierarchical BCIs may offer a scalable and robust approach
to controlling complex robotic devices in real-world environments. Our
ongoing efforts are focused on testing the approach with a larger number of
subjects and investigating its applicability to other challenging problems
such as controlling a robotic arm with grasping capabilities."

---

Brain-computer interfaces (BCIs) allow a user to directly control devices
such as cursors and robots using brain signals. 

We report results from four subjects who used a
hierarchical EEG-based BCI to successfully train and control a humanoid robot
in a virtual home environment. Gaussian processes were used for learning
high-level commands, allowing a BCI to switch between autonomous and
user-guided modes based on the current estimate of uncertainty. We also
report the first instance of multi-tasking in a BCI, involving simultaneous
control of two different devices by a single user. Our results suggest that
hierarchical BCIs can provide a flexible and robust way of controlling
complex robotic devices in real-world environments.

[pdf]

}}

LOGIC


  

@inproceedings{hinrichs-forbus-12_learning-qualitative-models-by-demonstration,
    title={Learning Qualitative Models by Demonstration},
    author={Hinrichs, T.R. and Forbus, K.D.},
    booktitle={Twenty-Sixth AAAI Conference on Artificial Intelligence},
    year={2012},
    annote =

---review 2013 gurpreet  ~gpskalra/cs365/hw2/paper_summary.pdf

based on 'Qualitative Process Theory" [2] which is useful for giving a
qualitative representation of quantities in terms of inequalities. They
dene a quantitative model as a directed acyclic graph of in
uences between
quantities conditioned by active processes that drive change [3]. They then
use this model to guide the behavior of a player (Playing Freeciv) through
demonstration.

Their work and objectives are summarized below:

 *  Decomposing goals into subgoals and nding the shortest path from a legal
	action to the goal using back propogation. [1]
 *  During the demonstration of the game,learn the user's motivation in
	taking an action,the causes behind game events, quantity changes that
	happen due to actions within a turn, and across turns and thus
	construct a qualitative model. [3]
 *  Learn direct and indirect qualitative in
uences using Quantitative
	Process theory and Microtheory Inheritance. [3]
 *  Learn to reach legal actions using hierarchical task network [3]
 *  Learning quantitative relations through three sub tasks: Identifying the
	independent variable, identifying the dependent variable,and
	incrementally inducing the function. [3] 

Creating software agents that learn interactively requires the ability to
learn from a small number of trials, extracting general, flexible knowledge
that can drive behavior from observation and interaction. We claim that
qualitative models provide a useful intermediate level of causal
representation for dynamic domains, including the formulation of strategies
and tactics. We argue that qualitative models are quickly learnable, and
enable model based reasoning techniques to be used to recognize,
operationalize, and construct more strategic knowledge.  This paper describes
an approach to incrementally learning qualitative influences by demonstration
in the context of a strategy game. We show how the learned model can help a
system play by enabling it to explain which actions could contribute to
maximizing a quantitative goal. We also show how reasoning about the model
allows it to reformulate a learning problem to address delayed effects and
credit assignment, such that it can improve its performance on more strategic
tasks such as city placement


[pdf]

}}



@inproceedings{benferhat-hue-11ijc_interval-based-possibilistic-logic,
 title={Interval-based possibilistic logic},
  author={Salem Benferhat and Julien  Hu{\'e} and Sylvain Lagrue and Julien Rossit},
  booktitle={Proceedings of the 22nd IJCAI Volume Two},
  pages={750--755},
  year={2011},
  abstract = {

--2013 Review Ashudeep Singh ~ashudeep/cs365/hw2.html
BETTER: ~ashudeep/cs365/hw2/hue_ijcai_review.pdf

Motivation

Possibilistic logic is a general framework used to deal with the logical
reasoning in case of inconsistent knowledge bases.[2] The major concern with
representing formulas in possibilistic logic formulas (pair of a
propositional logic formula with a positive real degree in [0,1]) is that it
is hard to associate exact degrees with formulas given a knowledge base due
to uncertainty. This paper aims at extending possibilistic logic to represent
expressions as the propositional logic formula along with a measure of the
imprecision regarding its uncertainity. Also this extension must keep in mind
the time complexity of the reasoning process.

Possibilistic Logic - Overview

Terminology [2]

  a finite propositional language, 
 a finite set of interpretations of,  
 an element of,  
 a possibilisitic distribution function, 
a degree of compatibility of interpretation  with the knowledge available. 
 measures the extent the formula  is compatible with the available knowledge, 
 measures the extent to which  is entailed. 
 is the possibilistic base (set of possibilistic formulas  ), 
 is the strict -cut containing   where ,  
  :  is inconsistent or ,  
 
Interpretations [3]

  means  is fully consistent with available knowledge, while  means  is impossible.
 means  is more compatible than '.

, iff   i.e.  is a consequence of  if the best models of  are more preferred than best models of .
Given a possibilistic base K, we can generate a unique possibility distribution where interpretations satisfying all propositional formulas have degree 1, while the others are pre-ordered w.r.t. highest formulas they falsify.
Soundness and Completeness of Possibilistic Logic 
  iff  
Interval Based possibility distribution

Interval based possibilistic logic expression is of the form   where  is a subinterval of [0,1]. 
Now given the above definitions for standard possibilistic logic, we need to redefine these in case of interval-based possibilistic knowledge. We start off with defining some operations on intervals:

Max of intervals: Given set of intervals take the max of the lower and upper limits.
Reverse of interval:  
Comparison:   if  . 
If   and ,    are in the knowledge base, then  is strictly preferred to .
Using the above operations on intervals, we redefine the terminology in interval-based possibilistic logic framework.
Interval-based Possibilistic base: An interval-based possibilistic base, denoted by IK, is a multi-set of formulas associated with intervals: 

Compatible possibilistic base: A possibilistic base K is said to be compatible with an interval based possibilistic base IK iff  a bijection  from IK to K s.t.  s.t.  .
 is the infinite set of all compatible possibilistic bases associated with IK.
 and  are possibilistic bases with lower and upper bounds of I respectively.
Interval-based possibilistic distribution: ,  namely   is bounded by weights associated with  w.r.t.  and .
 iff  .


 iff  .
The paper also proves that the new framework is sound as well as complete i.e. 

 iff  
Computational Complexity

The desicion problem 'Whether a given formula  is a consequence of a standard possibilistic base' is  -complete. [6] The paper states that the desicion problem 'Whether a given formula  is a consequence of an interval-based possibilistic base' is also  -complete. The paper presents an algorithm that uses the calls to SAT(propositional satisfiability test) to solve the problem in   time.
Conclusion

The paper has successfully proposed a new possibilistic logic framework that answers the limitations of standard possibilistic logic frameworks of assigning imprecise single-valued degrees assigned to propositions. The paper extends the various definitions in standard possibilistic logic to interval-based possibilistic logic. Also the process of reasoning in the new framework is as efficient computationally as in standard framework but also much more powerful in application. The future work is to study the fusion and revision process in interval-based possibilistic logic framework.
Bibliography

1
Benferhat, Salem and Hué, Julien and Lagrue, Sylvain and Rossit, Julien, Interval-based possibilistic logic, IJCAI'11
2
S. Benferhat and H. Prade. Compiling possibilistic knowledge bases. In Proc. of ECAI’06, pages 337–341, 2006.
3
D. Dubois and H. Prade. Possibilistic logic: a retrospective and prospective view. Fuzzy Sets and Systems, 144:3–23, 2004.
4
D. Dubois, J. Lang, and H. Prade. Timed possibilistic logic. Fundamenta Informaticae, XV:211–234, 1992.
5
D. Dubois, J. Lang, and H. Prade. Possibilistic logic. In Handbook of Logic in Artificial Intelligence and Logic Programming, volume 3, pages 439–513, 1994.
6
B. Nebel. Base revision operations and schemes: semantics, representation and
complexity. In Proc. of ECAI’94, pages 341–345, 1994.

---ABSTRACT

Possibilistic logic is a well-known framework for dealing with uncertainty
and reasoning under inconsistent knowledge bases. Standard possibilistic
logic expressions are propositional logic formulas associated with positive
real degrees belonging to [0,1]. However, in practice it may be difficult for
an expert to provide exact degrees associated with formulas of a knowledge
base. This paper proposes a flexible representation of uncertain information
where the weights associated with formulas are in the form of intervals. We
first study a framework for reasoning with interval-based possibilistic
knowledge bases by extending main concepts of possibilistic logic such as the
ones of necessity and possibility measures. We then provide a
characterization of an interval-based possibilistic logic base by means of a
concept of compatible standard possibilistic logic bases. We show that
interval-based possibilistic logic extends possibilistic logic in the case
where all intervals are singletons. Lastly, we provide computational
complexity results of deriving plausible conclusions from interval-based
possibilistic bases and we show that the flexibility in representing
uncertain information is handled without extra computational costs.

[pdf]

benferhat-hue-11ijc_ashudeep-review

}}

GAME PLAY




@article{barker-korf-12_solving-peg-solitaire,
 title={Solving Peg Solitaire with Bidirectional BFIDA},
  author={Barker, J.K. and Korf, R.E.},
  year={2012},
  abstract = {

a pagoda function is one where for any three holes in  a st line with values
x,y,z, must have the constraint that x <= y+z or z<= y+x.   This reflects the
fact that a peg, beginning at x or z can go to z or x resp., thus the pagoda
fn can only reduce over time; so reaching a square with a value > goal means
its a dead end.  This can be used to speed up search.

peg type = parity of row and column [four types].  These never change over
the game for each type.  a peg can jump over certain types only.

uses such analysis to represent a peg-solitaire game, and solve it using an
efficient bi-directional IDA* search.

--- review 2013 Lovish ~lovishc/cs365/hw2/report.pdf

The work contributes towards the previous solution of the problem in three
aspects:-

* Improves the heuristic previously used to solve the peg soliatire problem.

* Makes detection of duplicate states easier

* Decreases the time taken by previous state-of-art solver in the last
  iteration.Thus it significantly de- creases overall time taken (by a order
  of magnitude 2) as last iteration takes most amount of time.

Introduction

A peg solitaire , known as Brainvita in India is a game for single person
where the objective is to change the state of board from initial state B1 to
a goal state B2 through consecutive jumps of pegs in the "jumped over" peg is
removed from board.It is played on variety of boards.Some are depicted in
figure 1.

The most common start state and end state are depicted in Figure 2 .At the start only central hole is empty
and at end only central hole holds a peg.

Contributions

* Improved Heuristics

They must be admissiable and consistant to find optimal solution to problem.
In this problem we are trying to find the least numbers of moves (consecutive
jumps of same peg count as one move).  The heuristic used in the paper are:-

Figure 1: French,English,Diamond and Triangle Boards

Figure 2: Start and End state

1. Number of Pegs at corners (hc(n))

rest of analysis for Cross-shaped {English) board - two 7x3 rectangles
overlapping at center 3x3: 33 squares

The pegs at corners cannot be jumped over.  So a move
must initiate at these pegs.Hence atleast moves equal to number of corner
pegs must be made to reach Final state.This heuristic will never overestimate
the number of moves which makes it admissiable.  But to have a more tighter
bound authors have defined two more heuristics.

2. Maximum moves required to remove excess pegs of each type (ht(n))

  - a. 12 blue: corner pegs, plus concave corners.
  - b. 5 red: center of each sub-square - since we must leave one at center,
	at nopoint red pegs can be removed completely.
  - c. 8 green : centers of rows in squares (forms 3 dotted columns)
  - d. 8 centers of columns  in squares (these form a row)

For the independence of this heuristic from the former one,authors have
neglected the pegs which are captured(jumped over) by the moves starting at
corners.

3. Number of occupied Merson Regions (hm(n))

A merson region is a subsection on board which when filled with pegs
completely,has no way to remove a peg by a move that does not originate
inside that region.Here authors count a static set of merson regions except
corner which is taken care off by the first heuristic.

The final heuristic that author uses is:-
h(n) = hc(n) + max(ht(n); hm(n))

* Detection of Duplicate States

This is quite a challenge - rotational variants are possible.  [mirror
variants?] 

Barker et al in this paper have made a tree level such that it cointains same
number of jumps.  Thus duplicates can be detected at that particluar level
and pruned to avoid redundant calculations.

* Avoiding the last iteration

Insted of using a traditional BFIDA (Bredth first Iterative Deepening
A),authors have used bidirectional search[2] varient of it.Increasing cutoff
and choosing the direction,forward or backward, based on the least number of
nodes passed on from previous iteration in that direction.The intersection of
search would give a candidate optimal solution.

This avoids the time taken by unidirectional BFIDA* in the last iteration
which lessesns the overall time of solving the board.

References
[1] Bell, G. I. 2007. "Diagonal peg solitaire". Integers: Electronic Journal Of Combinatorial Number Theory
7(G1):20.
[2] Pohl, I. 1971. "Bi-directional search". Machine Intelligence 6:127{140
[3] Peg Soliatire,Wikipedia,https://en.wikipedia.org/wiki/Peg solitaire
[4] Jurgen Coller on Peg Soliatire,http://www.mathematische-basteleien.de/solitaire.htm
[5] Barker, Joseph K., and Richard E. Korf. "Solving Peg Solitaire with Bidirectional BFIDA." (2012).

Figure 1 has been taken from wikipedia.Figure 2 and 4 from [4].Figure 3 from [5] itself.

---

[pdf]

}}

VISION: OBJECT RECOGNITION




@article{hinton-srivastava-12_NN-prevents-co-adaptation-of-feature-detectors,
  title={Improving neural networks by preventing co-adaptation of feature detectors},
  author={Hinton, G.E. and Srivastava, N. and Krizhevsky, A. and Sutskever, I. and Salakhutdinov, R.R.},
  journal={arXiv preprint arXiv:1207.0580},
  year={2012},
  annote = {

imagenet = more than 1000 object classes, organized in a hierarchy based on
wordnet, with 1000s of images in most classes. 

RESULT: 
winning entry in imagenet challenge 2010 was based on deep-NN with error rate
of 47.2% on the test set. The current state-of-the-art result on this dataset
is 45.7%. We achieved comparable performance of 48.6% error using a single
neural network with five convolutional hidden layers interleaved with
“max-pooling” layer followed by two globally connected layers and a final
1000-way softmax layer. All layers had L2 weight constraints on the incoming
weights of each hidden unit. Using 50% dropout in the sixth hidden layer
reduces this to a record 42.4%. 

---

When a large feedforward neural network is trained on a small training set,
it typically performs poorly on held-out test data. This “overfitting” is
greatly reduced by randomly omitting half of the feature detectors on each
training case. This prevents complex co-adaptations in which a feature
detector is only helpful in the context of several other specific feature
detectors. Instead, each neuron learns to detect a feature that is generally
helpful for producing the correct answer given the combinatorially large
variety of internal contexts in which it must operate. Random “dropout” gives
big improvements on many benchmark tasks and sets new records for speech and
object recognition.

	 
[pdf]

}}



@article{krizhevsky-hinton-12-imagenet-convolutional-NN-deep,
 title={ImageNet classification with deep convolutional neural networks},
  author={Krizhevsky, A. and Sutskever, I. and Hinton, G.},
  journal={Advances in Neural Information Processing Systems},
  volume={25},
  year={2012}

winner: 2012 imagenet challenge:

SuperVision	Alex Krizhevsky, Ilya Sutskever, Geoffrey Hinton

---

Aim and Methodology:

The main aim of this paper is to develop an intelligent classifier using
Convolutional Neural Networks (CNNs) which can recognize the object from an
image which was taken from the dataset ImageNet.  ImageNet is a huge dataset
which contains over 15 million well-labelled images. Tasks involving simple
object recognition can be done with good accuracy with datasets of smaller
size as well, but object recognition in real life situations are more complex
and need large training data. Even this amount of training sample is not
sufficient to get accurate results. But CNNs (involving 2D convolution)
paired with GPUs can be used to achieve the goal. They (the authors of this
paper) have utilised the property of CNN to make good and precise assumptions
about the images’ nature. Their network consisted of 5 convolutional and 3
fully connected layers (shown figure1 below).

This depth of network was obtained after many experiments by the authors, and
any change in depth of the network resulted in increase of error. They used
about 1.2 million images for training, which, owing to the size of network,
resulted in over-fitting.

So they expanded their dataset by taking every possible image block of size
224x224 out of 256x256 and taking their horizontal reflections, which
increased the number of training data by a factor of 2048, that too without
involving any computation cost, because while GPU was performing the task of
learning one layer, they used parallel generation of such transformed images
using CPU. They have used a non-saturating, nonlinear function to model the
output of any neuron. The function they used is f(x) = max(0; x) . It was
found on the basis of their experiments that this model requires less time to
train the model than the standard sigmoid function. Due to large size of
training data, single GPU could not fulfil the task, as present GPUs have
only 3 GB memory, which is insufficient and hence two GPUs were used. The
model used for normalizing the neuron response was activity and k, n, α and β
are hyper-parameters. Following update rules were used for weights w:

Figure 1: Architecture of the  network

Conclusion: This network achieved the best top-1 and top-5 error rates as
37.5% and 17.0%. Thus we can see that deep CNN achieved record-breaking
results. This network is capable to recognise objects that are not
necessarily located at the centre of the image, as can be seen in the figure,
while there were some errors as well. Unsupervised pre learning was not used,
although it could have helped. This CNN classifier can further be used for
similar tasks in videos as well.

Note: Every sentence, image, equation, data etc. marked as has been taken 
from the original paper of which this is a review.

---review 2013 Triya Bhattacharya

The intended goal of the experiments was to create a deep, convolutional
network that uses supervised learning to achieve better (lower) error rates
than the rates previously observed, to identify images, on a highly
challenging dataset.

The parameters used for judging if the CNN is able to recognise the object is
given by “Top-1” and “Top-5” predictions made – that is the top prediction
made, and the the top 5 predictions made, and matching which ones are
correct. The training set used was ILSVRC-2010 images, and ILSVRC-2012 images
as training sets, and ILSVRC-2010 for the test set .which containd 15 million
high resolution images in about 22,000 categories. Given a rectangular image,
they first rescaled the image such that the shorter side was of length 256,
and then cropped out the central 256x256 patch from the resulting image.

The main feature of the architecture of the CNN is the different layers –
there are 5 convolutional layers, and 3 fully connected layers, and this
feature is the main feature which helps to reduce the error rates, as the
removal of even a single layer degrades the performance by about 2%.  This
figure shows the test results for the previously used models, and the new CNN
model. The network’s input is 150K- dimensional, and the number of neurons
in the network’s remaining layers is given by
253K–186K–65K–65K–43K– 4096–4096–1000.

The main features of their model included the usage of

- ReLU nonlinearity (Rectified Linear Units) instead of the usual saturating
	nonlinearities (with hyprebolic functions)which resulted in
	accelerated ability to fit the training set and increasing the
	learning rate.

- Training on multiple GPUs, one of which dealt with images in a
	color-agnostic manner, and the other in a color-specific manner. This
	increased the memory that they had to increase the size of the
	training set. Half the kernels are on one training set, and the other
	half on the other, with the GPUs communicating only in certain
	layers. This allows them to precisely tune the amount of
	communication until it is an acceptable fraction of the amount of
	computation. For example, the kernels of layer 3 take input from all
	kernel maps in layer 2. However, kernels in layer 4 take input only
	from those kernel maps in layer 3 which reside on the same
	GPU. Two-GPU net also takes less time to train than one-GPU net.

- Other than this, the local response normalization aids generalization even
	though ReLUs don't require it. This sort of response normalization
	implements a form of lateral inhibition inspired by the type found in
	real neurons, creating competition for big activities amongst neuron
	outputs computed using different kernels.

- Also, overlapping pooling by pooling layers in CNNs summarize the outputs
	of neighboring groups of neurons in the same kernel map. ( a pooling
	layer can be thought of as consisting of a grid of pooling units
	spaced s pixels apart, each summarizing a neighborhood of size z z
	centered at the location of the pooling unit) During training that
	models with overlapping pooling find it slightly more difficult to
	overfit.

- The architecture is such that the kernels of the second, fourth, and fifth
	convolutional layers are connected only to those kernel maps in the
	previous layer which reside on the same GPU. The kernels of the third
	convolutional layer are connected to all kernel maps in the second
	layer. The neurons in the fully-connected layers are connected to all
	neurons in the previous layer. Response-normalization layers follow
	the first and second convolutional layers. Max-pooling layers, follow
	both response-normalization layers as well as the fifth convolutional
	layer. The ReLU non-linearity is applied to the output of every
	convolutional and fully-connected layer.

Overfitting is reduced by augmenting data to artificially enlarge the dataset
using label-preserving transformations, either by generating image
translations, and horizontal reflections, or by performing PCA on the set of
RGB pixel values on all images.. Another method that reduces overfitting is
dropout which consists of setting to zero the output of each hidden neuron
with probability 0.5. The neurons which are “dropped out” in this way do not
contribute to the forward pass and do not participate in backpropagation. So
every time an input is presented, the neural network samples a different
architecture, but all these architectures share weights. As for the learning
rate, they used an equal learning rate for all layers, and divided the
learning rate by 10 when the validation error rate stopped improving with the
current learning rate. These ideas have all contributed to the significant
reduction in error rates for top-1 and top-5 predictions.  The network is
completely dependent on all the layers, so the depth is important for the
performance, which may be improved upon in the coming versions. No
unsupervised pre-training was used, which could be implemented to further
decrease the value of error rates. Also, it is dependent on computational
power to compute the labels for the large datasets, and it has been achieved
by making network larger and training it longer.

Ultimately, very large and deep convolutional nets on video sequences can be
used, where the temporal structure provides very helpful information that is
missing or far less obvious in static images.

---

Our model is a large, deep convolutional neural network
trained on raw RGB pixel values. The neural network, which has 60 million
parameters and 650,000 neurons, consists of five convolutional layers, some
of which are followed by max-pooling layers, and three globally-connected
layers with a final 1000-way softmax. It was trained on two NVIDIA GPUs for
about a week. To make training faster, we used non-saturating neurons and a
very efficient GPU implementation of convolutional nets. To reduce
overfitting in the globally-connected layers we employed hidden-unit
"dropout", a recently-developed regularization method that proved to be very
effective.

}]



@inproceedings{geiger-lauer-11cvpr_generative-3D-urban-scene-understanding,
  title={A generative model for 3d urban scene understanding from movable platforms},
  author={Geiger, A. and Lauer, M. and Urtasun, R.},
  booktitle={Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on},
  pages={1945--1952},
  year={2011},
  organization={IEEE}

[see their followup work in CVPR12 : geiger-lenz-12_autonomous-driving]

Traffic scene modeling - given a sequence of surveillance images, infers
traffic flow, and identifies streets,  number of streets at a junction,
traffic behaviour, and also scene features such as 
buildings, etc.. 

basic inferencing based on extension of metropolis-hastings algo called
Reversible Jump-Markov chain Monte Carlo (MCMC).  

main idea: enables comparisons of different probabilistic models w different
   topologies (and diff density distributions) by constructing a bijection
   (jump) between them via addl states.  Reversible Jump = RJ-MCMC


---13 365 Anant Raj:

Presented model is generative so sampling can be done from it. Since the
described approach is very general and takes into account both the dynamic
features and static features. So by putting constraints on dynamic features
static scene can be anaysed and vice versa. They have described the result of
camparison between the GP regression and the presented approach[3]. Also this
approach can be used in improving the object detection result and semantic
scene segmentation due to it's geometrical and topological constraints. It is
more likly a car to be on the road in place of home or indoors. Same case
goes with the scene also.

[pdf]

}}



@inproceedings{parkhi-vedaldi-zisserman-jawahar-12_cats-and-dogs,
title={Cats and dogs},
author={Parkhi, O.M. and Vedaldi, A. and Zisserman, A. and Jawahar, CV},
booktitle={Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on},
pages={3498--3505},
year={2012},
annote = {

We investigate the fine grained object categorization problem of determining
the breed of animal from an image.  To this end we introduce a new annotated
dataset of pets, the Oxford-IIIT-Pet dataset, covering 37 different breeds of
cats and dogs. The visual problem is very challenging as these animals,
particularly cats, are very deformable and there can be quite subtle
differences between the breeds.

We make a number of contributions: first, we introduce a model to classify a
pet breed automatically from an image.  The model combines shape, captured by
a deformable part model detecting the pet face, and appearance, captured by a
bag-of-words model that describes the pet fur. Fitting the model involves
automatically segmenting the animal in the image. Second, we compare two
classification approaches: a hierarchical one, in which a pet is first
assigned to the cat or dog family and then to a breed, and a flat one, in
which the breed is obtained directly. We also investigate a number of animal
and image orientated spatial layouts.

These models are very good: they beat all previously published results on the
challenging ASIRRA test (cat vs dog discrimination). When applied to the task
of discriminating the 37 different breeds of pets, the models obtain an
average accuracy of about 59%, a very encouraging result considering the
difficulty of the problem.


[pdf]

}}



@article{vezhnevets-ferrari-12_weakly-supervised-semantic-segmentation,
title={Weakly Supervised Structured Output Learning for Semantic Segmentation},
author={Vezhnevets, A. and Ferrari, V. and Buhmann, J.M.}
booktitle={Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on},
year={2012},
annote = {

[semantic segmentation = assign a label for each pixel:
(e.g. "dog", "car" or "road").]

standard approach: fullly supervised (every pixel is
manually labeled) [1, 2, 3, 4, 5] --> high annotation cost.
weakly supervised methods: [6,7]

tests on label-me dataset. comparisons only within own algorithms. ]

ABSTRACT
We address the problem of weakly supervised semantic segmentation. The
training images are labeled only by the classes they contain, not by their
location in the image. On test images instead, the method must predict a
class label for every pixel. Our goal is to enable segmentation algorithms to
use multiple visual cues in this weakly supervised setting, analogous to what
is achieved by fully supervised methods. However, it is difficult to assess
the relative usefulness of different visual cues from weakly supervised
training data. We define a parametric family of structured models, where each
model weighs visual cues in a different way. We propose a Maximum Expected
Agreement model selection principle that evaluates the quality of a model
from the family without looking at superpixel labels. Searching for the best
model is a hard optimization problem, which has no analytic gradient and
multiple local optima. We cast it as a Bayesian optimization problem and
propose an algorithm based on Gaussian processes to efficiently solve it. Our
second contribution is an Extremely Randomized Hashing Forest that represents
diverse superpixel features as a sparse binary vector. It enables using
appearance models of visual classes that are fast at training and testing and
yet accurate.  Experiments on the SIFT-flow data-set show a significant
improvement over previous weakly supervised methods and even over some fully
supervised methods.

We formulate semantic segmentation as
a pairwise CRF, as in other works [1, 7, 5, 6]. The task of
the model is to infer latent superpixel labels in training images
and learn appearance models of classes.

Our second contribution is an improved representation
of the appearance models of semantic classes:
xtremely Randomized Hashing Forest (ERHF).


[pdf]

}}



@inproceedings{kavukcuoglu-ranzato-lecun-09cvpr_learning-invariant-features,
 title={Learning invariant features through topographic filter maps},
  author={Kavukcuoglu, K. and Ranzato, M.A. and Fergus, R. and Le-Cun, Y.},
  booktitle={Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on},
  pages={1605--1612},
  year={2009},
  annote = {

one of the strongest classifiers for MNIST.

ABSTRACT:
Several recently-proposed architectures for high-performance object
recognition are composed of two main stages: a feature extraction stage that
extracts locally-invariant feature vectors from regularly spaced image
patches, and a somewhat generic supervised classifier.  The first stage is
often composed of three main modules: (1) a bank of filters (often oriented
edge detectors); (2) a non-linear transform, such as a point-wise squashing
functions, quantization, or normalization; (3) a spatial pooling operation
which combines the outputs of similar filters over neighboring regions.

We propose a method that automatically learns such feature extractors in an
unsupervised fashion by simultaneously learning the filters and the pooling
units that combine multiple filter outputs together. The method automatically
generates topographic maps of similar filters that extract features of
orientations, scales, and positions. These similar filters are pooled
together, producing locally-invariant outputs. The learned feature
descriptors give comparable results as SIFT on image recognition tasks for
which SIFT is well suited, and better results than SIFT on tasks for which
SIFT is less well suited.

---

It is well established that sparse coding algorithms applied to natural
images learn basis functions that are localized oriented edges and resemble
the receptive fields of simple cells in area V1 of the mammalian visual
cortex [19]. These methods produce feature representation that are sparse,
but not invariant. If the input pattern is slightly distorted, the
representation may change drastically. Moreover, these features represent
information about local texture, and hence, are rather inefficient when used
to preprocess whole images because they do not exploit the redundancy in
adjacent image patches. Finally, most sparse coding algorithms [19, 14, 17,
24, 4] have found limited applications in vision due to the high
computational cost of the iterative optimization required to compute the
feature descriptor.

In this paper, we introduce an algorithm, named Invariant
Predictive Sparse Decomposition (IPSD), that: (1)
learns features that are invariant to small variations inherent
in the data, and (2) produces more efficient representations
because they can be compact and directly computed using
a feed-forward function, without requiring the use of any iterative
optimization procedure.

++++

[pdf]

}}



@inproceedings{schulz-behnke-11_object-segmentation-deep-convolutional-net,
 title={Object-class segmentation using deep convolutional neural networks},
  author={Schulz, H. and Behnke, S.},
  booktitle={Proceedings of the DAGM Workshop on New Challenges in Neural Computation},
  pages={58--61},
  year={2011}
  annote = {

Generative-discriminative classifier for segmenting the main object
(foreground) in an image.


Abstract
After successes at image classification, segmentation is the next step
towards image understanding for neural networks. We propose a convolutional
network architecture that outperforms current methods on the
challenging INRIA-Graz02 dataset with regards to accuracy and speed.


[pdf]

}}



@InProceedings{tangY-salakhutdinov-hinton-12icml_deep-lambertian-albedo-learning,
  author =    {Yichuan Tang and Ruslan Salakhutdinov and Geoffrey Hinton},
  title =     {Deep Lambertian Networks},
  booktitle = {Proceedings of the 29th International Conference on Machine Learning (ICML-12)},
  series =    {ICML '12},
  year =      {2012},
  editor =    {John Langford and Joelle Pineau},
  location =  {Edinburgh, Scotland, GB},
  isbn =      {978-1-4503-1285-1},
  month =     {July},
  publisher = {Omnipress},
  address =   {New York, NY, USA},
  pages=      {1623--1630},
  annote = {

Abstract: Visual perception is a challenging problem in part due to
illumination variations. A possible solution is to first estimate an
illumination invariant representation before using it for recognition. The
object albedo and surface normals are examples of such representation. In
this paper, we introduce a multilayer generative model where the latent
variables include the albedo, surface normals, and the light
source. Combining Deep Belief Nets with the Lambertian reflectance
assumption, our model can learn good priors over the albedo from 2D
images. Illumination variations can be explained by changing only the
lighting latent variable in our model. By transferring learned knowledge from
similar objects, albedo and surface normals estimation from a single image is
possible in our model. Experiments demonstrate that our model is able to
generalize as well as improve over standard baselines in one-shot face
recognition.

discussion+video: http://icml.cc/discuss/2012/791.html

[pdf]

}}

COMPUTER-AIDED INSTRUCTION




@inproceedings{singh-gulwani-12_automatically-generating-algebra-problems.
 title=  {Automatically Generating Algebra Problems},
  author={Rohit Singh and Sumit Gulwani and Sriram Rajamanian},
  booktitle={Twenty-Sixth AAAI Conference on Artificial Intelligence},
  year={2012}
  annote = {

[generates algebra problems in a graded manner to help students learn.  The
key idea is to have a "similarity" function based on which problems similar
to those the student finds difficult can be generated. ]

---Review 2013 ~prigoyal/cs365/hw2/report.pdf

--Goal--

In this paper, another aspect of education namely ‘problem generation’ has
been automated i.e. given an algebraic proof problem p involving equalities
(of the form LHS = RHS), how to automatically generate the problems ‘similar’
to p, by allowing the user to define the notion of ‘similar’ by
finetuning. The focus is on providing personalization to a student trying to
learn a particular concept where a student interacts with the computer to
generate the problems which are of interest to him/her and according to
his/her notion of similarity [2]. The methodology presented generates unique,
fresh and interesting problems as the user wants.

--Importance of Research problem--

1) For a teacher, giving the fresh problems to students that uses same set of
    concepts and same level of difficulty is a tedious task. This paper
    directly targets this problem. [3]

2) Although there are many online sites that provide exercise material, but
    they provide only a fixed set of exercises which doesn’t provide enough
    ‘personalization’ to a learner trying to grasp a concept. By allowing a
    user to interact with computer and generate the problems according to
    his/her idea of similarity, user gets to solve fresh, interesting
    problems of same difficulty level and same concepts. This is one of the
    central focuses of this paper. [3]

--Previous Work on this problem--

There were two approaches to generate similar problems:

a) Flexibility provided for instantiating parameters of a problem with random
    constants. But this was done only for the constant in problem and hence
    it doesn’t generate interesting problems.

b) Certain features of the problem domain are provided as hard-coded options
    and users are able to  
choose among these options and generate problems. Some examples of feature include for 
example in quadratic domain: “simple factorable”, “difficult factorable”. However this approach 
is limited to only simple algebraic domains like counting, or linear and quadratic equation 
solving. [3]

--Main Contributions--

a) The methodology presented is applicable to large sub-fields of algebra.

b) The user can interactively fine tune the notion of similarity according to his/her tastes and needs
and generate problems.

--Brief Overview of the methodology--

A Query language (fig. 1 and fig. 2 [3]) has been defined using which the problems are represented 
and generated by the following methodology.

1) Query Generation: 

    Given a problem p it is associated with an abstract set of problems
    (denoted by [[Q]]). This step generates the abstract space [[Q]] by
    recursively defined notion (using tree structure) of a ‘term’, a
    ‘QProblem’, variables domain and query constraints (like denominator! =0
    etc.). Some query constraints are kept as default like if two numbers are
    divided then their gcd = 1 is kept as constraint besides many
    other. Further, it has some set of operations (like Unary op., binary op.
    etc.) predefined that are abstractly placed in tree nodes and by
    replacing the choice nodes for some value in set, a large [[Q]] space is
    generated. A sample problem generated is in figure 3 [3]Figure 3: Example
    of a query problem

2) Query Execution:

    By executing the query Q, a subset of ‘valid’, ‘correct’ and ‘unique’
    problems is extracted using some methods. Also some Backtracking is done
    to reduce the size and time of problem generation.

3) Query Tuning:

   ‘The user can remove some automatically generated constraints or improve
    the generalization by adding more choices in existing choice nodes or new
    choice nodes to the query’ to generate the problems according to the
    interest and notion of similarity.

RESULTS: For the 5 algebra domain, the testing was done and a number of
similar problems were generated. Some problems were unique and were not
present in the text. Manual validation of each problem generated was done to
ensure that the problems are indeed interesting and correct. The query
execution required atmost one iteration to generate the valid and correct
problems.

REFERENCES:

1. Computer Aided Learning:
http://en.wikipedia.org/wiki/Computer-supported_collaborative_learning
2. Personalization Learning: http://en.wikipedia.org/wiki/Personalized_learning
3. Gulwani, S., Rajamani, S., Singh, R. 2012: Automatically Generating Algebra Problems,
Association for the Advancement of Artificial Intelligence.
4. Singh, R., and Gulwani, S. 2012a. Learning semantic string transformations from examples. Very 
Large Databases (VLDB) 5.
5. Gulwani, S.; Korthikanti, V. A.; and Tiwari, A. 2011. Synthesizing geometry constructions. In 
Programming Language. Design and Implementation (PLDI), 50–61.

---

abstract:
We propose computer-assisted techniques for helping with pedagogy in
Algebra. In particular, given a proof problem p (of the form
Left-hand-side-term = Righthand- side-term), we show how to automatically
generate problems that are similar to p. We believe that such a tool can be
used by teachers in making examinations where they need to test students on
problems similar to what they taught in class, and by students in generating
practice problems tailored to their specific needs. Our first insight is that
we can generalize p syntactically to a query Q that implicitly represents a
set of problems [[Q]] (which includes p). Our second insight is that we can
explore the space of problems [[Q]] automatically, use classical results from
polynomial identity testing to generate only those problems in [[Q]] that are
correct, and then use pruning techniques to generate only unique and
interesting problems. Our third insight is that with a small amount of manual
tuning on the query Q, the user can interactively guide the computer to
generate problems of interest to her.  We present the technical details of the
above mentioned steps, and also describe a tool where these steps have been
implemented. We also present an empirical evaluation on a wide variety of
problems from various sub-fields of algebra including polynomials,
trigonometry, calculus, determinants etc. Our tool is able to generate a rich
corpus of similar problems from each given problem; while some of these
similar problems were already present in the textbook, several were new!

[pdf]

}}

MULTI-AGENT SYSTEMS




@inproceedings{brooks-iba_11ijc-emergence-and-convergence-of-norms,
  title={Modeling the emergence and convergence of norms},
  author={Brooks, L. and Iba, W. and Sen, S.},
  booktitle={Proceedings of the Twenty-Second international joint conference on Artificial Intelligence-Volume Volume One},
  pages={97--102},
  year={2011},
  abstract = {

In many multi-agent systems, the emergence of norms is the primary factor
that determines overall behavior and utility. Agent simulations can be used
to predict and study the development of these norms. However, a large number
of simulations is usually required to provide an accurate depiction of the
agents’ behavior, and some rare contingencies may still be overlooked
completely. The cost and risk involved with agent simulations can be reduced
by analyzing a system theoretically and producing models of its behavior. We
use such a theoretical approach to examine the dynamics of a population of
agents playing a coordination game to determine all the norms to which the
society can converge, and develop a system of linear recurrence relations
that predict how frequently each of these norms will be reached, as well as
the average convergence time. This analysis produces certain guarantees about
system behavior that canot be provided by a purely empirical approach, and
can be used to make predictions about the emergence of norms that numerically
match those obtained through large-scale simulations.

doi: 10.5591/978-1-57735-516-8/IJCAI11-028

[pdf]

}}



@inproceedings{levy-markovitch-12_machine-learning-from-metaphor,
 title={Teaching Machines to Learn by Metaphors},
  author={Omer Levy and Shaul Markovitch},
  booktitle={Twenty-Sixth AAAI Conference on Artificial Intelligence},
  year={2012},
  annote = {

Abstract
Humans have an uncanny ability to learn new concepts with very few
examples. Cognitive theories have suggested that this is done by utilizing
prior experience of related tasks. We propose to emulate this process in
machines, by transforming new problems into old ones. These transformations
are called metaphors. Obviously, the learner is not given a metaphor, but
must acquire one through a learning process. We show that learning metaphors
yield better results than existing transfer learning methods. Moreover, we
argue that metaphors give a qualitative assessment of task relatedness.


[pdf]

}}