CS671: Natural Language Processing :: Homework 3

Homework 3 - Paper Review

Update: Discussion groups
In this homework / project preparation, you should read a paper from an NLP conference and write a small ppt presentation. This presentation should be done within your group and should be discussed by everyone. They should then be improved based on the feedback. One or two presentations from each group will be presented in class.
To select a paper from the list below, fill out this form, ensuring that no one before you has selected the paper.
The pdf of most of these papers is available in the directory http://web.cse.iitk.ac.in/users/cs671/15_papers/
Due date: PPTs should be uploaded by August 27.
Presentations should be discussed by each group in class on August 28. Should be improved and presented to next class. Discussion groups (Not project groups!) will be formed by Aug 20 based on your paper selections.
List of papers

WORD EMBEDDING (WORD VECTOR ELARNING)

santos-15_boosting-NER-w-neural-character-embeddings
guo-che-14_revisiting-embedding-for-semi-supervised-learning
liu-jiang-15_learning-semantic-word-embeddings-from-ordinal-knowledge
chen-lin-15_revisiting-word-embedding_contrasting-meaning
levy-goldberg-14_linguistic-regularities-in-sparse-explicit-word-reprs
perozzi-alRfou-14_inducing-language-networks-from-continuous-word-reprs
iacobacci-pilehvar-15_sensembed-embeddings-for-word-relations
astudillo-amir-15_learning-word-repr-from-scarce-and-noisy-data
levy-goldberg-14-NIPS_neural-word-embedding-as-matrix-factorization
stratos-collins-15-ACL_word-embeddings-from-count-matrices
gouws-sogaard-15-naacl_task-specific-bilingual-word-embeddings
VECTORS FOR PHRASES / SENTENCES
kiros-salakhutdinov-14_multimodal-neural-language-models
kalchbrenner-grefenstette-14_CNN-for-modelling-sentences
yin-W-schuetze-15-CONLL_multichannel-convolution-for-sentence-vector
le-P-zuidema-15_compositional-distributional-semanticsw-LSTM
WORD VECTOR POS-TAGGING
fonseca-rosa-15_evaluating-word-embeddings_POS-tagging-in-portuguese
yin-W-schuetze-14-ACL_exploration-of-embeddings-for-generalized-phrases
lin-ammar-15_unsupervised-pos-induction-with-word-embeddings
santos-zadrozny-14-icml_character-level-repre-for-POS-tagging-portuguese
WORD VECTOR FOR SYNTAX
qu-L-ferraro-baldwin-15-CONLL_big-data-domain-word-repr_sequence-labelling
maillard-clark-15-CONLL_learning-adjectives-w-tensor-skip-gram
iyyer-manjunatha-15-deep-unordered-composition-rivals-syntax
andreas-klein-14_much-word-embeddings-encode-syntax
ling-dyer-15_two-simple-adaptations-of-word2v-ec-for-syntax
bansal-gimpel-14_tailoring-word-vectors-for-dependency-parsing
socher-bauer-ng-manning-13_parsing-with-vector-grammars
attardi-15_deepNL-deep-learning-nlp-pipeline
bisk-hockenmaier-15_probing-unsupervised-grammar-induction
bride-cruys-asher-15-ACL_lexical-composition-in-distributional-semantics
SEMANTICS
neelakantan-roth-15_compositional-vectors_knowledge-base-completion
bing-li-P-15-ACL_abstractive-multi-document-summarization_phrase-merging
purver-sadrzadeh-15_distributional-semantics-to-pragmatics
miller-gurevych-15_automatic-disambiguation-of-english-puns
zhou-xu-15_end-to-end-semantic-role-labeling-w-RNN
wang-berant-15_building-semantic-parser-overnight
mitchell-steedman-15_orthogonal-syntax-and-semantics-in-distributional-spaces
camacho-pilehvar-15-ACL_unified-multilingual-semantics-of-concepts
tackstrom-ganchev-das-15_efficient-structured-learning-for-semantic-role-labeling
yang-aloimonos-aksoy-15-ACL_learning-semantics-of-manipulation
quirk-mooney-15_language-to-code_semantic-parsers-for-if-else--recipes
guo-wang-15_semantically-smooth-knowledge-graph-embedding
dong-wei-15_question-answering-over-freebase-multi-column-CNN
peng-chang-roth-15-CONLL_joint-coreference-and-mention-head-detection
bogdanova-dos-15-CONLL_detecting-semantically-equivalent-questions
mihaylov-georgiev-15-CONLL_finding-opinion-manipulation-trolls-in-forums
kulkarni-alRfou-perozzi-15_statistically-significant-linguistic-change
alRfou-kulkarni-14_polyglot-NER-massive-multilingual-named-entity-recognition
SENTIMENT
wang-15_sentiment-aspect-w-restricted-boltzmann-machines
zhou-chen-15_learning-bilingual-sentiment-word-embeddings
klinger-cimiano-15-CONLL_instance-selection_for_cross-lingual-sentiment
labutov-basu-15-ACL_deep-questions-without-deep-understanding
persing-ng-V-15_modeling-argument-strength-in-student-essays
MORPHOLOGY
ma-J-hinrichs-15-ACL_accurate-linear-time-chinese-word-segm-w-WVs
cotterell-muller-15-CONLL_labeled-morphological-segmentation-HMM
CROSS-LANGUAGE MODELS
deMarneffe-dozat-14_universal-dependencies_cross-linguistic
ishiwatari-kaji-15-CONLL_accurate-cross-lingual-projection_word-vectors
johannsen-hovy-15-CONLL_cross-lingual-syntactic-variation_age-gender
duong-cohn-15-CONLL_cross-lingual-transfer-unsupervised-dependency-parsing
xiao-guo-15-CONLL_annotation-projection-for-cross-lingual-parsing
HINDI / BANGLA
chatterji-sarkar-basu-14_dependency-annotation_bangla-treebank
senapati-garain-14_maximum-entropy-based-honorificity_bengali-anaphora
bhatt-semwal-15-CONLL_iterative-similarity_cross-domain-text-classification
lim-T-soon-LK-14_lexicon+-tx_rapid-multilingual-lexicon-for-under-resourced-lgs
bhatt-narasimhan-misra-09_multi-representational-treebank-for-hindi-urdu
tammewar-singla-15_leveraging-lexical-information-in-SMT
TRANSLATION
luong-sutskever-le-15-ACL_rare-word-problem-in-neural-machine-translation
luong-kayser-15-CONLL_deep-neural-language-models-for-translation
al-Badrashiny-eskander-14_automatic-transliteration_romanized-arabic
mikolov-le-Q-13_similarities-in-lg-for-machine-translation
MULTIMODAL (IMAGES + LANGUAGE)
kiros-salakhutdinov-14_multimodal-neural-language-models
li-chen-15-ACL_visualizing-and-understanding-neural-models-in-NLP
shutova-tandon-15_perceptually-grounded-selectional-preferences
kennington-schlangen-15_compositional_perceptually-grounded-word-learning
elliott-de-15_describing-images-using-inferred-visual-dependency-representations
weegar-astrom-15-CONLL_linking-entities-across-images-and-text
List of papers bibTeX



@InProceedings{yang-aloimonos-aksoy-15-ACL_learning-semantics-of-manipulation,
  author    = {Yang, Yezhou  and  Aloimonos, Yiannis  and  Fermuller, Cornelia  and 
    Aksoy, Eren Erdal},
  title     = {Learning the Semantics of Manipulation Action},
  booktitle = {Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and
    the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)},
  month     = {July},
  year      = {2015},
  address   = {Beijing, China},
  publisher = {Association for Computational Linguistics},
  pages     = {676--686},
  url       = {http://www.aclweb.org/anthology/P15-1066}
  annote = {

Abstract

In this paper we present a formal computational framework for modeling
manipulation actions. The introduced formalism leads to semantics of
manipulation action and has applications to both observing and understanding
human manipulation actions as well as executing them with a robotic mechanism
(e.g. a humanoid robot). It is based on a Combinatory Categorial Grammar. The
goal of the introduced framework is to: (1) represent manipulation actions
with both syntax and semantic parts, where the semantic part employs
-calculus; (2) enable a probabilistic semantic parsing schema to learn the
lambda-calculus representation of manipulation action from an annotated
action corpus of videos; (3) use (1) and (2) to develop a system that
visually observes manipulation actions and understands their meaning while it
can reason beyond observations using propositional logic and axiom
schemata. The experiments conducted on a public available large manipulation
action dataset validate the theoretical framework and our implementation.

}}





@article{quirk-mooney-15_language-to-code_semantic-parsers-for-if-else--recipes,
  title={Language to Code: Learning Semantic Parsers for If-This-Then-That Recipes},
  author={Quirk, Chris and Mooney, Raymond and Galley, Michel}
  booktitle={Proceedings of ACL},
  year={2015},
  annote = {

Abstract

Using natural language to write programs is a touchstone problem for
computational linguistics. We present an approach that learns to map
natural-language descriptions of simple “if-then” rules to executable
code. By training and testing on a large corpus of naturally-occurring
programs (called “recipes”) and their natural language descriptions, we
demonstrate the ability to effectively map language to code. We compare a
number of semantic parsing approaches on the highly noisy training data
collected from ordinary users, and find that loosely synchronous systems
perform best.

}}




@article{zhou-chen-15_learning-bilingual-sentiment-word-embeddings,
  title={Learning Bilingual Sentiment Word Embeddings for Cross-language Sentiment
    Classification},
  author={Zhou, Huiwei and Chen, Long and Shi, Fulin and Huang, Degen}
  booktitle={Proceedings of ACL},
  year={2015},
  annote = {

Requires parallel corpus only. 

---

The sentiment classification performance relies on high-quality sentiment
resources.  However, these resources are imbalanced in different
languages. Cross-language sentiment classification (CLSC) can leverage the
rich resources in one language (source language) for sentiment classification
in a resource-scarce language (target language). Bilingual embeddings could
eliminate the semantic gap between two languages for CLSC, but ignore the
sentiment information of text. This paper proposes an approach to learning
bilingual sentiment word embeddings (BSWE) for English-Chinese CLSC. The
proposed BSWE incorporate sentiment information of text into bilingual
embeddings. Furthermore, we can learn high-quality BSWE by simply employing
labeled corpora and their translations, without relying on large scale
parallel corpora. Experiments on NLP&CC 2013 CLSC dataset show that our
approach outperforms the state-of-theart systems.

}}



@article{iyyer-manjunatha-15-deep-unordered-composition-rivals-syntax},
  title={Deep Unordered Composition Rivals Syntactic Methods for Text Classification},
  author={Iyyer, Mohit and Manjunatha, Varun and Boyd-Graber, Jordan and III,
    Hal Daum{\'e}},
  booktitle={Proceedings of ACL},
  year={2015},
  annote = {


}}



@article{elliott-de-15_describing-images-using-inferred-visual-dependency-representations,
  title={Describing Images using Inferred Visual Dependency Representations},
  author={Elliott, Desmond and de Vries, Arjen P},
  year={2015},
  booktitle={Proc. ACL-15},
  annote = {

ABSTRACT
Many existing deep learning models for natural language processing
tasks focus on learning the compositionality of their inputs, which requires
many expensive computations.  We present a simple deep neural network that
competes with and, in some cases, outperforms such models on sentiment
analysis and factoid question answering tasks while taking only a fraction of
the training time. While our model is syntactically-ignorant, we show
significant improvements over previous bag-of-words models by deepening our
network and applying a novel variant of dropout. Moreover, our model performs
better than syntactic models on datasets with high syntactic variance. We
show that our model makes similar errors to syntactically-aware models,
indicating that for the tasks we consider, nonlinearly transforming the input
is more important than tailoring a network to incorporate word order and
syntax.

}}



@article{zhou-xu-15_end-to-end-semantic-role-labeling-w-RNN,
  title={End-to-end Learning of Semantic Role Labeling Using Recurrent Neural Networks},
  author={Zhou, Jie and Xu, Wei}
  booktitle={Proceedings of ACL},
  year={2015},
  annote = {
}}



@article{wang-berant-15_building-semantic-parser-overnight,
  title={Building a Semantic Parser Overnight},
  author={Wang, Yushi and Berant, Jonathan and Liang, Percy}
  booktitle={Proceedings of ACL},
  year={2015},
  annote = {

Abstract

How do we build a semantic parser in a new domain starting with zero training
examples?  We introduce a new methodology for this setting: First, we use a
simple grammar to generate logical forms paired with canonical
utterances. The logical forms are meant to cover the desired set of
compositional operators, and the canonical utterances are meant to capture
the meaning of the logical forms (although clumsily). We then use
crowdsourcing to paraphrase these canonical utterances into natural
utterances. The resulting data is used to train the semantic parser. We
further study the role of compositionality in the resulting
paraphrases. Finally, we test our methodology on seven domains and show that
we can build an adequate semantic parser in just a few hours.

crowdsourced dataset is available, also code. 

}}



@article{miller-gurevych-15_automatic-disambiguation-of-english-puns,
  title={Automatic disambiguation of English puns},
  author={Miller, Tristan and Gurevych, Iryna}
  booktitle={Proceedings of ACL},
  year={2015},
  annote = {

Abstract

Traditional approaches to word sense disambiguation (WSD) rest on the
assumption that there exists a single, unambiguous communicative intention
underlying every word in a document. However, writers sometimes intend for a
word to be interpreted as simultaneously carrying multiple distinct
meanings. This deliberate use of lexical ambiguity—i.e., punning— is a
particularly common source of humour.  

In this paper we describe how traditional, language-agnostic WSD approaches
can be adapted to “disambiguate” puns, or rather to identify their double
meanings. We evaluate several such approaches on a manually sense-annotated
collection of English puns and observe performance exceeding that of some
knowledge-based and supervised baselines.

annotated pun database probably not available yet: 
Pending clearance of the distribution rights, we will make
some or all of our annotated data set available on our website
at https://www.ukp.tu-darmstadt.de/data/.

}}




@article{neelakantan-roth-15_compositional-vectors_knowledge-base-completion,
  title={Compositional Vector Space Models for Knowledge Base Completion},
  author={Neelakantan, Arvind and Roth, Benjamin and McCallum, Andrew},
  booktitle={Proceedings of ACL},
  year={2015},
  annote = {

Abstract

Knowledge base (KB) completion adds new facts to a KB by making inferences
from existing facts, for example by inferring with high likelihood
nationality(X,Y) from bornIn(X,Y). Most previous methods infer simple one-hop
relational synonyms like this, or use as evidence a multi-hop relational path
treated as an atomic feature, like bornIn(X,Z)!containedIn(Z,Y). 

This paper presents an approach that reasons about conjunctions of multi-hop
relations non-atomically, composing the implications of a path using a
recurrent neural network (RNN) that takes as inputs vector embeddings of the
binary relation in the path. Not only does this allow us to generalize to
paths unseen at training time, but also, with a single high-capacity RNN, to
predict new relation types not seen when the compositional model was trained
(zero-shot learning). We assemble a new dataset of over 52M relational
triples, and show that our method improves over a traditional classifier by
11%, and a method leveraging pre-trained embeddings by 7%.

{arXiv preprint arXiv:1504.06662},


}}



@article{labutov-basu-15-ACL_deep-questions-without-deep-understanding,
  title={Deep Questions without Deep Understanding},
  author={Labutov, Igor and Basu, Sumit and Vanderwende, Lucy}
  booktitle={Proceedings of ACL},
  year={2015},
  annote = {


develops an approach for generating deep (i.e, high-level) comprehension
questions from novel text that bypasses the myriad challenges of creating a
full semantic representation. 

1. represents the original text in a low-dimensional ontology, 
2. crowd-sources candidate question templates aligned with that space
3. ranks potentially relevant templates for a novel region of text

If ontological labels are not available, we infer them from the text. We
demonstrate the effectiveness of this method on a corpus of articles from
Wikipedia alongside human judgments, and find that we can generate relevant
deep questions with a precision of over 85% while maintaining a recall of
70%.

---

uses section titles (e.g. wikipedia)

crowdsourcing: after reading a section, they are asked to "write a follow-up
test question on ..."

--> creates ~ 1K templates for "deeper", non-factoid questions. 

==some of the questions generated==

What accomplishments characterized the later ca-reer of Colin Cowdrey?
Acting Career

How did Corbin Bern-stein’s television career evolve over time?

What are some unique ge-ographic features of Puerto Rico Highway 10?

How much significance do people of DeMartha Cath-olic High School place on athletics?

How does the geography of Puerto Rico Highway 10 impact its resources?

What type of reaction did Thornton Dial receive?

What were the most im-portant events in the later career of Corbin Berstein?

What geographic factors influence the preferred transport methods in Wey-mouth Railway
Station?

How has Freddy Mitch-ell’s legacy shaped current events?




}}


@article{astudillo-amir-15_learning-word-repr-from-scarce-and-noisy-data,
  title={Learning Word Representations from Scarce and Noisy Data with Embedding Sub-spaces},
  author={Astudillo, Ramon F and Amir, Silvio and Lin, Wang and Silva, M{\'a}rio and
    Trancoso, Isabel}
  booktitle={Proceedings of ACL},
  year={2015},
  annote = {

ABSTRACT

We investigate a technique to adapt unsupervised word embeddings to specific
applications, when only small and noisy labeled datasets are
available. Current methods use pre-trained embeddings to initialize model
parameters, and then use the labeled data to tailor them for the intended
task. However, this approach is prone to overfitting when the training is
performed with scarce and noisy data. To overcome this issue, we use the
supervised data to find an embedding subspace that fits the task
complexity. All the word representations are adapted through a projection
into this task-specific subspace, even if they do not occur on the labeled
dataset. This approach was recently used in the SemEval 2015 Twitter
sentiment analysis challenge, attaining state-of-the-art results. Here we
show results improving those of the challenge, as well as additional
experiments in a Twitter Part-Of-Speech tagging task.

}}



@article{persing-ng-V-15_modeling-argument-strength-in-student-essays,
  title={Modeling Argument Strength in Student Essays},
  author={Persing, Isaac and Ng, Vincent}
  booktitle={Proceedings of ACL},
  year={2015},
  annote = {

Abstract

While recent years have seen a surge of interest in automated essay grading,
including work on grading essays with respect to particular dimensions such
as prompt adherence, coherence, and technical quality, there has been
relatively little work on grading the essay dimension of argument strength,
which is arguably the most important aspect of argumentative essays.  We
introduce a new corpus of argumentative student essays annotated with
argument strength scores and propose a supervised, feature-rich approach to
automatically scoring the essays along this dimension. Our approach
significantly outperforms a baseline that relies solely on heuristically
applied sentence argument function labels by up to 16.1%.

}}



@article{mitchell-steedman-15_orthogonal-syntax-and-semantics-in-distributional-spaces,
  title={Orthogonality of Syntax and Semantics within Distributional Spaces},
  author={Mitchell, Jeff and Steedman, Mark}
  booktitle={Proceedings of ACL},
  year={2015},
  annote = {


}}



@article{bing-li-P-15-ACL_abstractive-multi-document-summarization_phrase-merging,
  title={Abstractive Multi-Document Summarization via Phrase Selection and Merging },
  author={Bing, Lidong and Li, Piji and Liao, Yi and Lam, Wai and Guo, Weiwei and
    Passonneau, Rebecca J},
  booktitle={Proceedings of ACL},
  year={2015},
  annote = {
arXiv preprint arXiv:1506.01597

}}



@article{luong-sutskever-le-15-ACL_rare-word-problem-in-neural-machine-translation,
  title={Addressing the Rare Word Problem in Neural Machine Translation},
  author={Luong, Thang and Sutskever, Ilya and Le, Quoc V and Vinyals, Oriol and 
    Zaremba, Wojciech},
  booktitle={Proceedings of ACL},
  year={2014},
  annote = {

2014 arXiv preprint arXiv:1410.8206
}}


@article{ma-J-hinrichs-15-ACL_accurate-linear-time-chinese-word-segm-w-WVs,
  title={Accurate Linear-Time Chinese Word Segmentation via Embedding Ma2015tching},
  author={Ma, Jianqiang and Hinrichs, Erhard}
  booktitle={Proceedings of ACL},
  year={2015},
  annote = {
}}


@article{li-luong-15-ACL_hierarchical-autoencoder-for-paragraphs-and-documents,
  title={A Hierarchical Neural Autoencoder for Paragraphs and Documents},
  author={Li, Jiwei and Luong, Minh-Thang and Jurafsky, Dan},
  booktitle={Proceedings of ACL},
  year={2015}
  annote = {

Generating multi-sentence (paragraphs/documents) is a challenging problem for
recurrent networks models. In this paper, we explore an important step toward
this generation task: training an LSTM (Longshort term memory) auto-encoder
to preserve and reconstruct multi-sentence paragraphs.

}}


@article{camacho-pilehvar-15-ACL_unified-multilingual-semantics-of-concepts,
  title={A unified multilingual semantic representation of concepts},
  author={Camacho-Collados, Jos{\'e} and Pilehvar, Mohammad Taher and Navigli, Roberto},
  journal={Proceedings of ACL, Beijing, China},
  year={2015},
  annote = {

ABST: most existing semantic representation techniques cannot be used
effectively for the representation of individual word senses. We put forward
a novel multilingual concept representation, called MUFFIN, which not only
enables accurate representation of word senses in different languages, but
also provides multiple advantages over existing approaches. MUFFIN represents
a given concept in a unified semantic space irrespective of the language of
interest, enabling cross-lingual comparison of different concepts. We
evaluate our approach in two different evaluation benchmarks, semantic
similarity and Word Sense Disambiguation, reporting state-of-the-art
performance on several standard datasets.

vector space representations are
* conventionally co-occurrence based 
	(Salton et al., 1975; Turney and Pantel, 2010; Landauer
	and Dooley, 2002), 
* the newer predictive branch (Collobert andWeston, 2008; Mikolov
	et al., 2013; Baroni et al., 2014), 

both suffer from a major drawback: they are unable to model individual
word senses or concepts, as they conflate
different meanings of a word into a single vectorial
representation.  [hinders WSD)

There have been several efforts to adapt and apply distributional approaches
to the representation of word senses (Pantel and Lin, 2002; Brody and Lapata,
2009; Reisinger and Mooney, 2010; Huang et al., 2012).

However, none of these techniques provides representations that are already
linked to a standard sense inventory, and consequently such mapping has to be
carried out either manually, or with the help of sense-annotated data.

Chen et al. (2014) addressed this issue and obtained vectors for individual
word senses by leveraging WordNet glosses. NASARI (Camacho-Collados et al.,
2015) is another approach that obtains accurate sense-specific
representations by combining the complementary knowledge from WordNet and
Wikipedia.

Graph-based approaches have also been successfully utilized to model
individual words (Hughes and Ramage, 2007; Agirre et al., 2009; Yeh et al.,
2009), or concepts (Pilehvar et al., 2013; Pilehvar and Navigli, 2014),
drawing on the structural properties of semantic networks.  The applicability
of all these techniques, however, is usually either constrained to a single
language (usually English), or to a specific task.

}}



@inproceedings{levy-goldberg-14-NIPS_neural-word-embedding-as-matrix-factorization,
  title={Neural word embedding as implicit matrix factorization},
  author={Levy, Omer and Goldberg, Yoav},
  booktitle={Advances in Neural Information Processing Systems},
  pages={2177--2185},
  year={2014},
  annote = {

suggests that what the word2vec model is doing is effectively factorizing a
count matrix - the LOW RANK FACTORIZATION gives the word embeddings. 

shows that the skip-gram model of Mikolov et al, when trained with their
negative sampling approach can be understood as a weighted matrix
factorization of a word-context matrix with cells weighted by point wise
mutual information (PMI), which has long been empirically known to be a
useful way of constructing word-context matrices for learning semantic
representations of words.

}}


@article{stratos-collins-15-ACL_word-embeddings-from-count-matrices,
  title={Model-based Word Embeddings from Decompositions of Count Matrices},
  author={Stratos, Karl and Collins, Michael and Hsu, Daniel}
  booktitle={Proceedings of ACL},
  year={2015},
  annote = {

A successful method for deriving word embeddings is the negative sampling
training of the skip-gram model suggested by Mikolov et al.  (2013b) and
implemented in the popular software WORD2VEC. The form of its training
objective was motivated by efficiency considerations, but has subsequently
been interpreted by Levy and Goldberg (2014b) as seeking a LOW-RANK
FACTORIZATION of a matrix whose entries are word-context co-occurrence
counts, scaled and transformed in a certain way.

This observation sheds new light on WORD2VEC, yet also raises several new
questions about word embeddings based on decomposing count data. What is the
right matrix to decompose?  Are there rigorous justifications for the choice
of matrix and count transformations?

We investigate these qs through the decomposition specified by canonical
correlation analysis (CCA, Hotelling, 1936), a powerful technique for
inducing generic representations whose computation is efficiently and exactly
reduced to that of a matrix singular value decomposition (SVD).  We build on
and strengthen the work of Stratos et al. (2014) which uses CCA for learning
the class of HMMs underlying Brown clustering.

CCA seeks to
maximize the Pearson
correlation coefficient between scalar random variables 
L;R, which is a value in [-11; 1] indicating the degree of
linear dependence between L and R.



}}


@article{bride-cruys-asher-15-ACL_lexical-composition-in-distributional-semantics,
  title={A Generalisation of Lexical Functions for Composition in Distributional Semantics},
  author={Bride, Antoine and Van de Cruys, Tim and Asher, Nicholas}
  booktitle={Proceedings of ACL},
  year={2015},
  annote = {

ABSTRACT: It is not straightforward to construct meaning representations
beyond the level of individual words – i.e. the combination of words into
larger units – using distributional methods. Our contribution is
twofold. First of all, we carry out a large-scale evaluation, comparing
different composition methods within the distributional framework for the
cases of both adjective-noun and noun-noun composition, making use of a newly
developed dataset. Secondly, we propose a novel method for composition, which
generalises the approach by Baroni and Zamparelli (2010).  The performance of
our novel method is also evaluated on our new dataset and proves competitive
with the best methods.

Requires availability of parser

---

the additive and multiplicative model (Mitchell and Lapata, 2008) are
entirely general, and take as input automatically constructed vectors for
adjectives and nouns. The method by Baroni and Zamparelli, on the other hand,
requires the acquisition of a particular function for each adjective,
represented by a matrix. The second goal of our paper is to generalise the
functional approach in order to eliminate the need for an individual function
for each adjective. To this goal, we automatically learn a generalised
lexical function, based on Baroni and Zamparelli’s approach. This generalised
function combines with an adjective vector and a noun vector in a generalised
way. The performance of our novel generalised lexical function approach is
evaluated on our test sets and proves competitive with the best, extant
methods.

Our semantic space for French was built using the FRWaC corpus (Baroni et
al., 2009) – about 1,6 billion words of web texts – which has been tagged
with MElt tagger (Denis et al., 2010) and parsed with MaltParser (Nivre et
al., 2006a), trained on a dependency-based version of the French treebank
(Candito et al., 2010).

Our
semantic space for English has been built using
the UKWaC corpus (Baroni et al., 2009), which
consists of about 2 billion words extracted from
the web. The corpus has been part of speech
tagged and lemmatized with Stanford Part-Of-
Speech Tagger (Toutanova and Manning, 2000;
Toutanova et al., 2003), and parsed with Malt-
Parser (Nivre et al., 2006b) trained on sections
2-21 of the Wall Street Journal section of the
Penn Treebank extended with about 4000 questions from the QuestionBank.
[http://maltparser.org/mco/english_parser/engmalt.html]

For both corpora, we extracted the lemmas of
all nouns, adjectives and (bag of words) context
words. We only kept those lemmas that consist of
alphabetic characters.
[filters out dates, numbers and punctuation]

}}



@inproceedings{purver-sadrzadeh-15_distributional-semantics-to-pragmatics,
  title={From Distributional Semantics to Distributional Pragmatics?},
  author={Purver, Matthew and Sadrzadeh, Mehrnoosh},
  booktitle={Interactive Meaning Construction A Workshop at IWCS 2015},
  pages={21},
  year={2015},
  annote = {

NO PDF

}}


##DEEP LEARNING SYNTAX DISCOVERY


@article{kalchbrenner-grefenstette-14_CNN-for-modelling-sentences,
   title={A convolutional neural network for modelling sentences},
  author={Kalchbrenner, Nal and Grefenstette, Edward and Blunsom, Phil},
  journal={arXiv preprint arXiv:1404.2188},
  year={2014},
  annote = {

models tree-like hierarchical structures of syntax

tested in Hindi by arpit and jayesh. 

Q.  how good is it in detecting phrase boundaries? 

}}



@article{li-chen-15-ACL_visualizing-and-understanding-neural-models-in-NLP,
  title={Visualizing and Understanding Neural Models in NLP},
  author={Li, Jiwei and Chen, Xinlei and Hovy, Eduard and Jurafsky, Dan},
  journal={arXiv preprint arXiv:1506.01066},
  year={2015},
  annote = {
}}


# MULTIMODAL GROUNDING 


@inproceedings{kiros-salakhutdinov-14_multimodal-neural-language-models,
  title={Multimodal neural language models},
  author={Kiros, Ryan and Salakhutdinov, Ruslan and Zemel, Rich},
  booktitle={Proceedings of the 31st International Conference on Machine Learning (ICML-14)},
  pages={595--603},
  year={2014},
  annote = {

Abstract

We introduce two multimodal neural language models: models of natural
language that can be conditioned on other modalities. An image-text
multimodal neural language model can be used to retrieve images given complex
sentence queries, retrieve phrase descriptions given image queries, as well
as generate text conditioned on images. We show that in the case of
image-text modelling we can jointly learn word representations and image
features by training our models together with a convolutional network. Unlike
many of the existing methods, our approach can generate sentence descriptions
for images without the use of templates, structured prediction, and/or
syntactic trees. While we focus on imagetext modelling, our algorithms can be
easily applied to other modalities such as audio.

Neural Language Models: A neural language model improves on n-gram language
   models by reducing the curse of dimensionality through the use of
   distributed word representations.  Each word in the vocabulary is
   represented as a real-valued feature vector such that the cosine of the
   angles between these vectors is high for semantically similar
   words. Several models have been proposed based on feedforward networks
   (Bengio et al., 2003), log-bilinear models (Mnih & Hinton, 2007),
   skip-gram models (Mikolov et al., 2013) and recurrent neural networks
   (Mikolov et al., 2010; 2011). Training can be sped up through the use of
   hierarchical softmax (Morin & Bengio, 2005) or noise contrastive
   estimation (Mnih & Teh, 2012)

Bengio, Yoshua, Ducharme, R´ejean, Vincent, Pascal, and Janvin, Christian. A
    neural probabilistic language model. J. Mach.  Learn. Res., 3:1137–1155,
    March 2003.
Mikolov, Tomas, Karafi´at, Martin, Burget, Lukas, Cernock`y, Jan, and
    Khudanpur, Sanjeev. Recurrent neural network based language model. In
    INTERSPEECH, pp. 1045–1048, 2010.
Mikolov, Tomas, Deoras, Anoop, Povey, Daniel, Burget, Lukas, and Cernocky,
    Jan. Strategies for training large scale neural network language
    models. In ASRU, pp. 196–201. IEEE, 2011.
Mikolov, Tomas, Chen, Kai, Corrado, Greg, and Dean, Jeffrey.  Efficient
    estimation of word representations in vector space.  arXiv preprint
    arXiv:1301.3781, 2013.
Mnih, Andriy and Hinton, Geoffrey. Three new graphical models for statistical
    language modelling. In ICML, pp. 641–648.  ACM, 2007.
Mnih, Andriy and Teh, Yee Whye. A fast and simple algorithm for training
    neural probabilistic language models. arXiv preprint arXiv:1206.6426,
    2012.
Morin, Frederic and Bengio, Yoshua. Hierarchical probabilistic neural network
    language model. In AISTATS, pp. 246–252, 2005.


img/kiros-14_multimodal-language_image-description-learning.jpg

Figure 1. Left two columns: Sample description retrieval given images. 
Right two columns: description generation. Each description
was initialized to ‘in this picture there is’ or ‘this product contains a’,
with 50 subsequent words generated.

[I think the top row are input, and bottom is generation; not L/R as in caption]

Image Description Generation: A growing body of research has explored how to
   generate realistic text descriptions given an image. Farhadi et al. (2010)
   consider learning an intermediate meaning space to project image and
   sentence features allowing them to retrieve text from images and vice
   versa. Kulkarni et al. (2011) construct a CRF using unary potentials from
   objects, attributes and prepositions and high-order potentials from text
   corpora, using an n-gram model for decoding and templates for constraints.
   To allow for more descriptive and poetic generation, Mitchell et
   al. (2012) propose the use of syntactic trees constructed from 700,000
   Flickr images and text descriptions. For large scale description
   generation, Ordonez et al. (2011) showed that non-parametric approaches
   are effective on a dataset of one million image-text captions.  More
   recently, Socher et al. (2014) show how to map sentence representations
   from recursive networks into the same space as images. We note that unlike
   most existing work, our generated text comes directly from language model
   samples without any additional templates, structure, or constraints.

Multimodal Representation Learning: Deep learning methods have been
successfully used to learn representations from multiple modalities. Ngiam et
al. (2011) proposed using deep autoencoders to learn features from audio and
video, while Srivastava & Salakhutdinov (2012) introduced the multimodal deep
Boltzmann machine as a joint model of images and text. Unlike Srivastava &
Salakhutdinov (2012), our proposed models are conditional and go beyond
bag-of-word features. 

More recently, Socher et al.  (2013) and Frome et
al. (2013) propose methods for mapping images into a text representation
space learned from a language model that incorporates global context (Huang
et al., 2012) or a skip-gram model (Mikolov et al., 2013), respectively. This
allowed Socher et al. (2013); Frome et al. (2013) to perform zero-shot 
learning, generalizing to classes the model has never seen before. Similar to
our work, the authors combine convolutional networks with a language model
but our work instead focuses on text generation and retrieval as opposed to
object classification.

Frome, Andrea, Corrado, Greg S, Shlens, Jon, Bengio, Samy, Dean, Jeffrey, and
    MarcAurelio Ranzato, Tomas Mikolov. Devise: A deep visual-semantic
    embedding model. 2013.
Huang, Eric H, Socher, Richard, Manning, Christopher D, and Ng, Andrew
    Y. Improving word representations via global context and multiple word
    prototypes. In ACL, pp. 873–882, 2012.
Socher, Richard, Ganjoo, Milind, Manning, Christopher D, and Ng,
    Andrew. Zero-shot learning through cross-modal transfer.  In NIPS,
    pp. 935–943, 2013.

The remainder of the paper is structured as follows. We first review the
log-bilinear model of Mnih & Hinton (2007) as it forms the foundation for our
work. We then introduce our two proposed models as well as how to perform
joint image-text feature learning. Finally, we describe our experiments and
results.



References

Bengio, Yoshua, Ducharme, R´ejean, Vincent, Pascal, and Janvin, Christian. A
    neural probabilistic language model. J. Mach.  Learn. Res., 3:1137–1155,
    March 2003.
Berg, Tamara L, Berg, Alexander C, and Shih, Jonathan. Automatic attribute
    discovery and characterization from noisy web data. In ECCV,
    pp. 663–676. Springer, 2010.
Coates, Adam and Ng, Andrew. The importance of encoding versus training with
    sparse coding and vector quantization. In ICML, pp. 921–928, 2011.
    Multimodal Neural Language Models
Collobert, Ronan and Weston, Jason. A unified architecture for natural
    language processing: Deep neural networks with multitask learning. In
    ICML, pp. 160–167. ACM, 2008.
Donahue, Jeff; Jia, Yangqing, Vinyals, Oriol, Hoffman, Judy; Zhang, Ning,
    Tzeng, Eric, and Darrell, Trevor. Decaf: A deep 
    convolutional activation feature for generic visual recognition.
Farhadi, Ali, Hejrati, Mohsen, Sadeghi, Mohammad Amin, Young, Peter,
    Rashtchian, Cyrus, Hockenmaier, Julia, and Forsyth, David. Every picture
    tells a story: Generating sentences from images. In ECCV,
    pp. 15–29. Springer, 2010. 
Frome, Andrea, Corrado, Greg S, Shlens, Jon, Bengio, Samy, Dean, Jeffrey, and
    MarcAurelio Ranzato, Tomas Mikolov. Devise: A deep visual-semantic
    embedding model. 2013.
Grangier, David, Monay, Florent, and Bengio, Samy. A discriminative approach
    for the retrieval of images from text queries. In Machine Learning: ECML
    2006, pp. 162–173. Springer, 2006.
Grubinger, Michael, Clough, Paul, M¨uller, Henning, and Deselaers,
    Thomas. The iapr tc-12 benchmark: A new evaluation resource for visual
    information systems. In InternationalWorkshop OntoImage, pp. 13–23, 2006.
Gupta, Ankush and Mannem, Prashanth. From image annotation to image
    description. In NIP, pp. 196–204. Springer, 2012.
Gupta, Ankush, Verma, Yashaswi, and Jawahar, CV. Choosing linguistics over
    vision to describe images. In AAAI, 2012.
Hodosh, Micah, Young, Peter, and Hockenmaier, Julia. Framing image
    description as a ranking task: data, models and evaluation metrics. JAIR,
    47(1):853–899, 2013.
Huang, Eric H, Socher, Richard, Manning, Christopher D, and
    Ng, Andrew Y. Improving word representations via global context and multiple
    word prototypes. In ACL, pp. 873–882, 2012.
Kiros, Ryan and Szepesv´ari, Csaba. Deep representations and codes for image
    auto-annotation. In NIPS, pp. 917–925, 2012.
Krizhevsky, Alex, Hinton, Geoffrey E, et al. Factored 3-way restricted
    boltzmann machines for modeling natural images. In AISTATS, pp. 621–628,
    2010.
Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoff. Imagenet classification
    with deep convolutional neural networks. In NIPS, pp. 1106–1114, 2012.
Kulkarni, Girish, Premraj, Visruth, Dhar, Sagnik, Li, Siming, Choi, Yejin,
    Berg, Alexander C, and Berg, Tamara L. Baby talk: Understanding and
    generating simple image descriptions.  In CVPR, pp. 1601–1608. IEEE,
    2011.
Memisevic, Roland and Hinton, Geoffrey. Unsupervised learning of image
    transformations. In CVPR, pp. 1–8. IEEE, 2007.
Mitchell, Margaret, Han, Xufeng, Dodge, Jesse, Mensch, Alyssa, Goyal, Amit,
    Berg, Alex, Yamaguchi, Kota, Berg, Tamara, Stratos, Karl, and Daum´e III,
    Hal. Midge: Generating image  descriptions from computer vision
    detections. In EACL, pp. 747–756, 2012.
Morin, Frederic and Bengio, Yoshua. Hierarchical probabilistic neural network
    language model. In AISTATS, pp. 246–252, 2005.
Ngiam, Jiquan, Khosla, Aditya, Kim, Mingyu, Nam, Juhan, Lee, Honglak, and Ng,
    Andrew. Multimodal deep learning. In ICML, pp. 689–696, 2011.
Oliva, Aude and Torralba, Antonio. Modeling the shape of the scene: A
    holistic representation of the spatial envelope. IJCV, 42(3):145–175,
    2001.
Ordonez, Vicente, Kulkarni, Girish, and Berg, Tamara L. Im2text: Describing
    images using 1 million captioned photographs. In NIPS, pp. 1143–1151,
    2011.
Papineni, Kishore, Roukos, Salim, Ward, Todd, and Zhu, Wei- Jing. Bleu: a
    method for automatic evaluation of machine translation. In ACL,
    pp. 311–318. ACL, 2002.
Socher, Richard, Ganjoo, Milind, Manning, Christopher D, and Ng,
    Andrew. Zero-shot learning through cross-modal transfer.  In NIPS,
    pp. 935–943, 2013.
Socher, Richard, Le, Quoc V, Manning, Christopher D, and Ng, Andrew
    Y. Grounded compositional semantics for finding and describing images
    with sentences. TACL, 2014.
Srivastava, Nitish and Salakhutdinov, Ruslan. Multimodal learning with deep
    boltzmann machines. In NIPS, pp. 2231–2239, 2012.
Swersky, Kevin, Snoek, Jasper, and Adams, Ryan. Multi-task bayesian
    optimization. In NIPS, 2013.
Turian, Joseph, Ratinov, Lev, and Bengio, Yoshua. Word representations: a
    simple and general method for semi-supervised learning. In ACL,
    pp. 384–394. Association for Computational Linguistics, 2010.
Wang, Tao,Wu, David J, Coates, Adam, and Ng, Andrew Y. End-to-end text
    recognition with convolutional neural networks. In ICPR,
    pp. 3304–3308. IEEE, 2012. 

}}



@incollection{senapati-garain-14_maximum-entropy-based-honorificity_bengali-anaphora,
  title={A Maximum Entropy Based Honorificity Identification for Bengali Pronominal
    Anaphora Resolution},
  author={Senapati, Apurbalal and Garain, Utpal},
  booktitle={Computational Linguistics and Intelligent Text Processing},
  pages={319--329},
  year={2014},
  publisher={Springer},
  annote = {


}}



@article{tackstrom-ganchev-das-15_efficient-structured-learning-for-semantic-role-labeling,
  title={Efficient Inference and Structured Learning for Semantic Role Labeling},
  author={T{\"a}ckstr{\"o}m, Oscar and Ganchev, Kuzman and Das, Dipanjan},
  journal={Transactions of the Association for Computational Linguistics},
  volume={3},
  pages={29--41},
  year={2015},
  annote = {

ABSTRACT

We present a dynamic programming algorithm for efficient constrained
inference in semantic role labeling. The algorithm tractably captures a
majority of the structural constraints examined by prior work in this area,
which has resorted to either approximate methods or off-the-shelf integer
linear programming solvers. In addition, it allows training a
globally-normalized log-linear model with respect to constrained conditional
likelihood. We show that the dynamic program is several times faster than an
off-the-shelf integer linear programming solver, while reaching the same
solution. Furthermore, we show that our structured model results in
significant improvements over its local counterpart, achieving
state-of-the-art results on both PropBank- and FrameNet-annotated corpora. 
 
}}


@article{santos-15_boosting-NER-w-neural-character-embeddings,
  title={Boosting Named Entity Recognition with Neural Character Embeddings},
  author={Santos, Cicero Nogueira dos and Guimar{\~a}es, Victor},
  journal={arXiv preprint arXiv:1505.05008},
  year={2015},
  annote = {

uses charWNN

ABSTRACT
Most state-of-the-art named entity recognition (NER) systems rely on
handcrafted features and on the output of other NLP tasks such as
part-of-speech (POS) tagging and text chunking. In this work we propose a
language-independent NER system that uses automatically learned features
only. Our approach is based on the CharWNN deep neural network, which uses
word-level and character-level representations (embeddings) to perform
sequential classification. We perform an extensive number of experiments
using two annotated corpora in two different languages: HAREM I corpus, which
contains texts in Portuguese; and the SPA CoNLL- 2002 corpus, which contains
texts in Spanish.  Our experimental results shade light on the contribution
of neural character embeddings for NER. Moreover, we demonstrate that the
same neural network which has been successfully applied to POS tagging can
also achieve state-of-the-art results for language-independet NER, using the
same hyperparameters, and without any handcrafted features. For the HAREM I
corpus, CharWNN outperforms the state-of-the-art system by 7.9 points in the
F1-score for the total scenario (ten NE classes), and by 7.2 points in the F1
for the selective scenario (five NE classes).

Given a sentence, the network gives for each word a score for each class
(tag) tau IN T. As depicted in Figure 1, in order to score a word, the
network takes as input a fixed-sized window of words centralized in the
target word.  The input is passed through a sequence of layers where features
with increasing levels of complexity are extracted. The output for the whole
sentence is then processed using the Viterbi algorithm (Viterbi, 1967) to
perform structured prediction.  For a detailed description of the CharWNN
neural network we refer the reader to (dos Santos and Zadrozny, 2014).

2.1 Word- and Character-level Embeddings

As illustrated in Figure 1, the first layer of the network transforms words
into real-valued feature vectors (embeddings). These embeddings are meant to
capture morphological, syntactic and semantic information about the words. We
use a fixed-sized word vocabulary Vwrd, and we consider that words are
composed of characters from a fixed-sized character vocabulary V chr. 

Given a sentence consisting of N words (w1;w2; :::;wN), every word wn is
converted into a vector un = [rwrd; rwch], which is composed of two
subvectors: the word-level embedding rwrd 2 Rdwrd and the character-level
embedding rwch 2 Rclu of wn. While word-level embeddings capture syntactic
and semantic information, character-level embeddings capture morphological
and shape information.  Word-level embeddings are encoded by column vectors
in an embedding matrix Wwrd 2 RdwrdjV wrdj, and retrieving the embedding of
a particular word consists in a simple matrix-vector multiplication. The
matrix Wwrd is a parameter to be learned, and the size of the word-level
embedding dwrd is a hyperparameter to be set by the user.

Given a word w composed of M characters (c1; c2; :::; cM), we first transform
each character cm into a character embedding rchr_m . Character embeddings
are encoded by column vectors in the embedding matrix [Wchr IN Rdchr x |Vchr|. 
Given a character c, its embedding rchr is obtained by the
matrix-vector product: rchr = Wchrvc, where vc is a vector of size
|V_chr| which has value 1 at index c and zero in all other positions. 

The input for the convolutional layer is the sequence of character embeddings
(rchr1 ; rchr2 ; :::; rchrM).

The convolutional layer applies a matrix-vector operation to each window of
size kchr of successive windows in the sequence (rchr1 ; rchr2 ; :::; rchrM).  
Let us define the vector zm IN Rdchr kchr as the concatenation of the 
character embedding m, its (kchr - 1)/2 left neighbors, and its (kchr - 1)/2
right neighbors:


}}



@article{le-P-zuidema-15_compositional-distributional-semanticsw-LSTM,
  title={Compositional Distributional Semantics with Long Short Term Memory},
  author={Le, Phong and Zuidema, Willem},
  journal={arXiv preprint arXiv:1503.02510},
  year={2015},
  annote = {



Abstract

We are proposing an extension of the recursive neural network that makes use
of a variant of the long short-term memory architecture.  The extension
allows information low in parse trees to be stored in a memory register (the
‘memory cell’) and used much later higher up in the parse tree. This provides
a solution to the vanishing gradient problem and allows the network to
capture long range dependencies.  Experimental results show that our
composition outperformed the traditional neural-network composition on the
Stanford Sentiment Treebank.

}}



@article{shutova-tandon-15_perceptually-grounded-selectional-preferences,
  title={Perceptually grounded selectional preferences},
  author={Shutova, Ekaterina and Tandon, Niket and de Melo, Gerard}
  booktitle={Proceedings of ACL},
  year={2015},
  annote = {

Selectional preference (SP) : 
	"fly" prefers to select "bird" rather than "dog"
	"winds" prefers "high" rather than "large"

to compute:  Take all verb phrases (from a large parse of a corpus)
	with head "fly" -> find frequency of "bird" as main argument
	vs that of "dog"

ABSTRACT
Selectional preference (SPs) are widely used in NLP as a rich source of
semantic information. While SPs have been traditionally induced from textual
data, human lexical acquisition is known to rely on both linguistic and
perceptual experience. We present the first SP learning method that
simultaneously draws knowledge from text, images and videos, using image and
video descriptions to obtain visual features. Our results show that it
outperforms linguistic and visual models in isolation, as well as the
existing SP induction approaches.  

We use the BNC as an approximation of linguistic
knowledge and a large collection of tagged images
and videos from Flickr (www.flickr.com)
as an approximation of perceptual knowledge. The
human-annotated labels that accompany media on
Flickr enable us to acquire predicate-argument cooccurrence
information.

predicate similarity: 
	Pad´o et al. (2007) and Erk (2007) used similarity
	metrics to approximate selectional preference
	classes. Their underlying hypothesis is that
	a predicate-argument combination (p, a) is felicitous
	if the predicate p is frequently observed in the
	data with the arguments a0 similar to a. The systems
	compute similarities between distributional
	representations of arguments in a vector space.
		(requires parser or chunker)



}}




@inproceedings{andreas-klein-14_much-word-embeddings-encode-syntax,
  title={How much do word embeddings encode about syntax},
  author={Andreas, Jacob and Klein, Dan},
  booktitle={Proceedings of ACL-14},
  year={2014},
  annote = {

Abstract

Do continuous word embeddings encode any useful information for constituency
parsing? We isolate three ways in which word embeddings might augment a
state-of-the-art statistical parser: 

* by connecting out-of-vocabulary words to known ones, 
* by encouraging common behavior among related in-vocabulary words, and 
* by directly providing features for the lexicon.  

We test each of these hypotheses with a targeted change to a state-of-the-art
baseline. Despite small gains on extremely small supervised training sets,
we find that extra information from embeddings appears to make little or no
difference to a parser with adequate training data.  

	Our results support an overall hypothesis that word embeddings import
	syntactic information that is ultimately redundant with distinctions
	learned from treebanks in other ways.

---

Semi-supervised and unsupervised models for a variety of core NLP tasks,
including named-entity recognition (Freitag, 2004), part-of-speech tagging
(Sch¨utze, 1995), and chunking (Turian et al., 2010) have been shown to
benefit from the inclusion of word embeddings as features.

In the other direction, access to a syntactic parse has been shown to be
useful for constructing word embeddings for phrases compositionally
(Hermann and Blunsom, 2013; Andreas and Ghahramani, 2013). 

Dependency parsers have seen gains from distributional statistics in the form
of discrete word clusters (Koo et al., 2008), and recent work (Bansal et al.,
2014) suggests that similar gains can be derived from embeddings like the
ones used in this paper.


[bansal-gimpel-14_tailoring-word-vectors-for-dependency-parsing]

It seems clear that word embeddings exhibit some syntactic structure.
Consider Figure 1, which shows embeddings for a variety of English
determiners, projected onto their first two principal components.

img/andreas-14_wv-clusters-for-determiners

We can see that the quantifiers EACH and EVERY cluster together, as do FEW
and MOST. These are precisely the kinds of distinctions between determiners
that state-splitting in the Berkeley parser has shown to be useful (Petrov
and Klein, 2007), and existing work (Mikolov et al., 2013b) has observed that
such regular embedding structure extends to many other parts of speech.

But we don’t know how prevalent or important such “syntactic axes” are in
practice.  Thus we have two questions: Are such groupings (learned on large
data sets but from less syntactically rich models) better than the ones the
parser finds on its own? How much data is needed to learn them without word
embeddings?

results : for two kinds of embeddings (Col&West and CBOW). For both choices
	of embedding, the pooling and OOV models provide small gains with
	very little training data, but no gains on the full training set.

our results are consistent with two claims: 

(1) unsupervised word embeddings do contain some syntactically useful
    information, but
(2) this information is redundant with what the model is able to determine
    for itself from only a small amount of labeled training data.

}}



@inproceedings{guo-che-14_revisiting-embedding-for-semi-supervised-learning,
  title={Revisiting embedding features for simple semi-supervised learning},
  author={Guo, Jiang and Che, Wanxiang and Wang, Haifeng and Liu, Ting},
  booktitle={Proceedings of EMNLP},
  pages={110--120},
  year={2014},
  annote = {


}}



@article{ling-dyer-15_two-simple-adaptations-of-word2v-ec-for-syntax,
  title={Two/Too Simple Adaptations of word2vec for Syntax Problems},
  author={Ling, Wang  and Chris Dyer and Alan Black and Trancoso, Isabel}
  journal={NAACL 2015},
  year = {2015},
  annote = {

Proposes "Structured Word2Vec" - more sensitive to word order within the context

in both CBOW and skip-Ngram methods, the order of
the context words does not influence the prediction
output. As such, while these methods may find similar
representations for semantically similar words,
they are less likely to representations based on the
syntactic properties of the words.

Structured Word2Vec

Abstract

We present two simple modifications to Word2Vec...  The main issue with the
original models is the fact that they are insensitive to word order. While
order independence is useful for inducing semantic representations, this
leads to suboptimal results when they are used to solve syntax-based
problems. We show improvements in part-ofspeech tagging and dependency
parsing using our proposed models.



}}

@inproceedings{bansal-gimpel-14_tailoring-word-vectors-for-dependency-parsing,
  title={Tailoring continuous word representations for dependency parsing},
  author={Bansal, Mohit and Gimpel, Kevin and Livescu, Karen},
  booktitle={Proceedings of the Annual Meeting of the Association for Computational
    Linguistics},
  year={2014},
  annote = {

Koo et al. (2008) found improvement on in-domain dependency parsing using
features based on discrete Brown clusters.  In this paper, we experiment with
parsing features derived from continuous representations.  We find that
simple attempts based on discretization of individual word vector dimensions
do not improve parsing. We see gains only after first performing a
hierarchical clustering of the continuous word vectors and then using
features based on the hierarchy.

We compare several types of continuous representations,
including those made available by
other researchers (Turian et al., 2010; Collobert et
al., 2011; Huang et al., 2012), and embeddings we
have trained using the approach of Mikolov et al.
(2013a), which is orders of magnitude faster than
the others. The representations exhibit different
characteristics, which we demonstrate using both
intrinsic metrics and extrinsic parsing evaluation.
We report significant improvements over our baseline
on both the Penn Treebank (PTB; Marcus et
al., 1993) and the English Web treebank (Petrov
and McDonald, 2012).

While all embeddings yield some parsing improvements, we find larger gains by
tailoring them to capture similarity in terms of context within syntactic
parses. To this end, we use two simple modifications to the models of Mikolov
et al.  (2013a): a smaller context window, and conditioning on syntactic
context (dependency links and labels).  Interestingly, the Brown clusters of
Koo et al. (2008) prove to be difficult to beat, but we find that our
syntactic tailoring can lead to embeddings that match the parsing performance
of Brown (on all test sets) in a fraction of the training time. Finally, a
simple parser ensemble on all the representations achieves the best results,
suggesting their complementarity for dependency parsing.

2 ContinuousWord Representations

There are many ways to train continuous representations; in this paper, we
are primarily interested in neural language models (Bengio et al., 2003),
which use neural networks and local context to learn word vectors. Several
researchers have made their trained representations publicly available, which
we use directly in our experiments.  In particular, we use the SENNA
embeddings of Collobert et al. (2011); the scaled TURIAN embeddings (C&W) of
Turian et al. (2010); and the HUANG global-context, single-prototype
embeddings of Huang et al. (2012). We also use the BROWN clusters trained by
Koo et al. (2008). Details are given in Table 1.

2.1.1 Smaller ContextWindows

We find that context window size w affects
the embeddings substantially: with large w, words
group with others that are topically-related; with
small w, grouped words tend to share the same
POS tag. We discuss this further in the intrinsic
evaluation presented in x2.2.

FN. We created a special vector for unknown words by averaging the vectors
for the 50K least frequent words; we did not use this vector for the SKIPDEP
(x2.1.2) setting because it performs slightly better without it.

2.1.2 Syntactic Context

We expect embeddings to help dependency parsing
the most when words that have similar parents
and children are close in the embedding space. To
target this type of similarity, we train the SKIP
model on dependency context instead of the linear
context in raw text. When ordinarily training SKIP
embeddings, words v0 are drawn from the neighborhood
of a target word v, and the sum of logprobabilities
of each v0 given v is maximized. We
propose to instead choose v0 from the set containing
the grandparent, parent, and children words of
v in an automatic dependency parse.

A simple way to implement this idea is to train
the original SKIP model on a corpus of dependency
links and labels. For this, we parse the
BLLIP corpus (minus PTB) using our baseline dependency
parser, then build a corpus in which each
line contains a single child word c, its parent word
p, its grandparent g, and the dependency label l of
the  link...

Evaluation: Two measures

Word similarity (SIM): One widely-used evaluation compares distances in the
  continuous space to human judgments of word similarity using the 353-pair
  dataset of Finkelstein et al. (2002). We compute cosine similarity between
  the two vectors in each word pair, then order the word pairs by similarity
  and compute Spearman’s rank correlation coefficient () with the gold
  similarities. Embeddings with high capture similarity in terms of
  paraphrase and topical relationships.

Clustering-based tagging accuracy (M-1): Intuitively, we expect embeddings to
  help parsing the most if they can tell us when two words are similar
  syntactically. To this end, we use a metric based on unsupervised
  evaluation of POS taggers.  We perform clustering and map each cluster to
  one POS tag so as to maximize tagging accuracy, where multiple clusters can
  map to the same tag. 

  For clustering, we use k-means with k = 1000 and initialize
  by placing centroids on the 1000 most-frequent words.

Table 2 shows these metrics for representations used in this paper. The BROWN
clusters have the highest M-1, indicating high cluster purity in terms of POS
tags. The HUANG embeddings have the highest SIM score but low M-1, presumably
because they were trained with global context, making them more tuned to
capture topical similarity.  We compare several values for the window size
(w) used when training the SKIP embeddings, finding that small w leads to
higher M-1 and lower SIM. Table 3 shows examples of clusters obtained by
clustering SKIP embeddings of w = 1 versus w = 10, and we see that the former
correspond closely to POS tags, while the latter are much more
topically-coherent and contain mixed POS tags.4 For parsing experiments, we
choose w = 2 for CBOW and w = 1 for SKIP. Finally, our SKIPDEP embeddings,
trained with syntactic context and w = 1 (x2.1.2), achieve the highest M-1 of
all continuous representations. In x4, we will relate these intrinsic metrics
to extrinsic parsing performance.

Table 3: Example clusters

1 [Mr., Mrs., Ms., Prof., ...], [Jeffrey, Dan, Robert,
Peter, ...], [Johnson, Collins, Schmidt, Freedman,
...], [Portugal, Iran, Cuba, Ecuador, ...], [CST, 4:30,
9-10:30, CDT, ...], [his, your, her, its, ...], [truly,
wildly, politically, financially, ...]

10 [takeoff, altitude, airport, carry-on, airplane, flown,
landings, ...], [health-insurance, clinic, physician,
doctor, medical, health-care, ...], [financing, equity,
investors, firms, stock, fund, market, ...]

Table 3: Example clusters for SKIP embeddings with window
size w = 1 (syntactic) and w = 10 (topical).

}}




@article{lin-ammar-15_unsupervised-pos-induction-with-word-embeddings,
  title={Unsupervised POS Induction with Word Embeddings},
  author={Lin, Chu-Cheng and Ammar, Waleed and Dyer, Chris and Levin, Lori},
  journal={arXiv preprint arXiv:1503.06760},
  year={2015},
  annote = {

Abstract

Unsupervised word embeddings have been shown to be valuable as features in
supervised learning problems; however, their role in unsupervised problems
has been less thoroughly explored. In this paper, we show that embeddings can
likewise add value to the problem of unsupervised POS induction. In two
representative models of POS induction, we replace multinomial distributions
over the vocabulary with multivariate Gaussian distributions over word
embeddings and observe consistent improvements in eight languages. We also
analyze the effect of various choices while inducing word embeddings on
“downstream” POS induction results.

1 Introduction

we explore the effect of replacing words with their vector space embeddings1
in two POS induction models: the classic first-order HMM (Kupiec, 1992) and
the newly introduced conditional random field autoencoder (Ammar et al.,
2014). In each model, instead of using a conditional multinomial
distribution2 to generate a word token wi IN V given a POS tag ti IN T, we use
a conditional Gaussian distribution and generate a d-dimensional word
embedding vwi IN Rd given ti.

Structured skip-gram embeddings (Ling et al., 2015) extend the standard
skip-gram embeddings (Mikolov et al., 2013) by taking into account the
relative positions of words in a given context.  We use the tool word2vec3
and Ling et al. (2015)’s modified version4 to generate both plain and
structured skip-gram embeddings in nine languages.

[Ling et al.2015] Wang Ling, Chris Dyer, Alan Black, and Isabel
	Trancoso. 2015. Two/too simple adaptations of word2vec for syntax
	problems. In Proc. of NAACL.

4.1 Choice of POS Induction Models

Here, we compare the following models for POS induction:

  * Baseline: HMM with multinomial emissions (Kupiec,
1992),
  * Baseline: HMM with log-linear emissions (Berg-
Kirkpatrick et al., 2010),
  * Baseline: CRF autoencoder with multinomial reconstructions
(Ammar et al., 2014),7
  * Proposed: HMM with Gaussian emissions, and
  * Proposed: CRF autoencoder with Gaussian reconstructions.

DATA. To train the POS induction models, we used
the plain text from the training sections of the

** CoNLL-X shared task (Buchholz and Marsi, 2006)
(for Danish and Turkish), the 

** CoNLL 2007 shared
task (Nivre et al., 2007) (for Arabic, Basque, Greek,
Hungarian and Italian), and the 

** Ukwabelana corpus
(Spiegler et al., 2010) (for Zulu). 

For evaluation, we obtain the corresponding gold-standard POS tags by
deterministically mapping the language-specific POS tags in the
aforementioned corpora to the corresponding 
universal POS tag set (Petrov et
al., 2012).  This is the same set up we used in (Ammar et al., 2014).

[Petrov et al.2012] Slav Petrov, Dipanjan Das, and Ryan
	McDonald. 2012. A universal part-of-speech tagset.
	In Proc. of LREC, May.


}}



@inproceedings{deMarneffe-dozat-14_universal-dependencies_cross-linguistic,
  title={Universal Stanford Dependencies: A cross-linguistic typology},
  author={De Marneffe, Marie-Catherine and Dozat, Timothy and Silveira, 
    Natalia and Haverinen, Katri and Ginter, Filip and Nivre, Joakim and Manning, 
    Christopher D},
  booktitle={Proceedings of LREC},
  pages={4585--4592},
  year={2014},
  annote = {

Revisiting the now de facto standard Stanford dependency representation, we
propose an improved taxonomy to capture grammatical relations across
languages, including morphologically rich ones. We suggest a two-layered
taxonomy: a set of broadly attested universal grammatical relations, to which
language-specific relations can be added. We emphasize the lexicalist stance
of the Stanford Dependencies, which leads to a particular, partially new
treatment of compounding, prepositions, and morphology. We show how existing
dependency schemes for several languages map onto the universal taxonomy
proposed here and close with consideration of practical implications of
dependency representation choices for NLP applications, in particular
parsing.

We consider how to treat grammatical relations in morphologically rich
languages, including achieving an appropriate parallelism between expressing
grammatical relations by prepositions versus morphology. We show how existing
dependency schemes for other languages which draw from Stanford Dependencies
(Chang et al., 2009; Bosco et al., 2013; Haverinen et al., 2013; Seraji et
al., 2013; McDonald et al., 2013; Tsarfaty, 2013) can be mapped onto the new
taxonomy proposed here. We emphasize the lexicalist stance of both most work
in NLP and the syntactic theory on which Stanford Dependencies is based, and
hence argue for a particular treatment of compounding and morphology.  We
also discuss the different forms of dependency representation that should be
available for SD to be maximally useful for a wide range of NLP applications,
converging on three versions: the basic one, the enhanced one (which adds
extra dependencies), and a particular form for parsing.

de Marneffe, M.-C. and Manning, C. D. (2008). The Stanford typed dependencies
   representation. In Workshop on Cross-framework and Cross-domain Parser
   Evaluation.

}}



@article{fonseca-rosa-15_evaluating-word-embeddings_POS-tagging-in-portuguese,
  title={Evaluating word embeddings and a revised corpus for part-of-speech tagging
    in Portuguese},
  author={Fonseca, Erick R and Rosa, Jo{\~a}o Lu{\'\i}s G and Alu{\'\i}sio, Sandra Maria},
  journal={Journal of the Brazilian Computer Society},
  volume={21},
  number={1},
  pages={1--14},
  year={2015},
  publisher={Springer},
  annote = {

no PDF, only online:
http://link.springer.com/article/10.1186/s13173-014-0020-x/fulltext.html

Based on charWNN like model:

7. Maia MRdH, Xexéo GB (2011) Part-of-speech tagging of Portuguese using
hidden Markov models with character language model emissions. In:
Proceedings of the 8th Brazilian Symposium in Information and Human
Language Technology, Cuiabà, Brazil. pp 159–163


Our new version of Mac-Morpho is available at
http://nilc.icmc.usp.br/macmorpho/, along with previous versions of the
corpus. Our tagger is at http://nilc.icmc.sc.usp.br/nlpnet/. 


}}



@inproceedings{gouws-sogaard-15-naacl_task-specific-bilingual-word-embeddings,
  title={Simple task-specific bilingual word embeddings},
  author={Gouws, Stephan and S{\o}gaard, Anders},
  booktitle={Proceedings of NAACL-HLT},
  pages={1386--1390},
  year={2015},
  annote = {


}}


@inproceedings{santos-zadrozny-14-icml_character-level-repre-for-POS-tagging-portuguese,
  title={Learning character-level representations for part-of-speech tagging},
  author={Santos, Cicero D and Zadrozny, Bianca},
  booktitle={Proceedings of the 31st International Conference on Machine Learning (ICML-14)},
  pages={1818--1826},
  year={2014},
  annote = {


for morphologically rich languages : 

instead of doing the analysis at the level of words, use the characters
in the words directly.  That way the system can find similarities between
morphologically inflected words

charWNN : generates character-level embeddings for each word, even
	for words outside the vocabulary.

the similarities for characters seem more morphologically biased
(see Tables below) 

... information about word morphology and shape is crucial for various NLP
tasks. Usually, when a task needs morphological or word shape information,
the knowledge is included as handcrafted features (Collobert, 2011). One task
for which character-level information is very important is part-of-speech
(POS) tagging, which consists in labeling each word in a text with a unique
POS tag, e.g. noun, verb, pronoun and preposition.

In this paper we propose a new deep neural network (DNN) architecture that
joins word-level and characterlevel representations to perform POS
tagging. The proposed DNN, which we call CharWNN, uses a convolutional layer
that allows effective feature extraction from words of any size. At tagging
time, the convolutional layer generates character-level embeddings for each
word, even for the ones that are outside the vocabulary. We present
experimental results that demonstrate the effectiveness of our approach to
extract character-level features relevant to POS tagging. Using CharWNN, we
create state-of-the-art POS taggers for two languages: English and
Portuguese.  Both POS taggers are created from scratch, without including any
handcrafted feature. Additionally, we demonstrate that a similar DNN that
does not use character-level embeddings only achieves state-of-the-art
results when using handcrafted features.

Although here we focus exclusively on POS tagging, the proposed architecture
can be used, without modifications, for other NLP tasks such as text chunking
and named entity recognition.

2.1. Initial Representation Levels

The first layer of the network transforms words into realvalued
feature vectors (embeddings) that capture morphological,
syntactic and semantic information about the
words. We use a fixed-sized word vocabulary V wrd, and
we consider that words are composed of characters from a
fixed-sized character vocabulary V chr. Given a sentence
consisting of N words fw1;w2; :::;wNg, every word wn
is converted into a vector un = [rwrd; rwch], which is
composed of two sub-vectors: the word-level embedding
rwrd 2 Rdwrd and the character-level embedding rwch 2
Rclu of wn. While word-level embeddings are meant to
capture syntactic and semantic information, character-level
embeddings capture morphological and shape informatio

2.1.2. CHARACTER-LEVEL EMBEDDINGS

Robust methods to extract morphological information from words must take into
consideration all characters of the word and select which features are more
important for the task at hand. For instance, in the POS tagging task,
informative features may appear in the beginning (like the prefix “un” in
“unfortunate”), in the middle (like the hyphen in “self-sufficient” and the
“h” in “10h30”), or at the end (like 
suffix “ly” in “constantly”). 

character convolution - first introduced
by Waibel et al. (1989). 

As depicted in Fig. 1,
our convolutional approach produces local features around
each character of the word and then combines them using
a max operation to create a fixed-sized character-level embedding
of the word.

Waibel, A., Hanazawa, T., Hinton, G., Shikano, K., and Lang, K. J. 
    Phoneme recognition using time-delay neural networks. IEEE Transactions on
    Acoustics, Speech and Signal Processing, 37(3):328–339, 1989.

Like in (Collobert et al., 2011), we use a prediction scheme
that takes into account the sentence structure. The method uses a transition
score At;u for jumping from tag t IN T to u IN T in successive words, and a
score A0;t for starting from the t-th tag. 

Given the sentence w1,w2..wN, 
the score for tag path t1..tN is computed as follows:

The input for the convolutional
layer is the sequence of character embeddings

The convolutional layer applies a matrix-vector operation
to each window of size kchr of successive windows in
the sequence.. 

2.3. Network Training
Our network is trained by minimizing a negative likelihood
over the training set D. In the same way as in
(Collobert et al., 2011), we interpret the sentence score (5)
as a conditional probability over a path. For this purpose,
we exponentiate the score (5) and normalize it with respect
to all possible paths. Taking the log, we arrive at the following
conditional log-probability...  

For the Portuguese experiments, we use three sources of
unlabeled text: the Portuguese Wikipedia; the CETENFolha2
corpus; and the CETEMPublico3 corpus. The Portuguese
Wikipedia corpus is preprocessed using the same
approach applied to the English Wikipedia corpus. However,
we implemented our own Portuguese tokenizer.

---

Table 1. WSJ Corpus for English POS Tagging.
SET 	   SENT.    TOKENS   OOSV   OOUV
TRAINING   38,219   912,344  0 	    6317
DEVELOP.   5,527    131,768  4,467  958
TEST 	   5,462    129,654  3,649  923

We use the Mac-Morpho corpus to perform POS tagging
of Portuguese language text. The Mac-Morpho corpus
(Alu´ısio et al., 2003) contains around 1.2 million manually
tagged words. Its tagset contains 22 POS tags (41 if punctuation
marks are included) and 10 more tags that represent
additional semantic aspects.

Our Theano based implementation of CharWNN takes
around 2h30min to complete eight training epochs for the
Mac-Morpho corpus. In our experiments, we use 4 threads
in a Intelr Xeon E5-2643 3.30GHz machine.

Table 10. Most similar words using character-level embeddings learned with WSJ Corpus.

INCONSIDERABLE    83-YEAR-OLD  SHEEP-LIKE     DOMESTICALLY  UNSTEADINESS 0.0055
------------------------------------------------------------------------------
INCONCEIVABLE 	  43-YEAR-OLD  ROCKET-LIKE    FINANCIALLY   UNEASINESS 0.0085
INDISTINGUISHABLE 63-YEAR-OLD  FERN-LIKE      ESSENTIALLY   UNHAPPINESS 0.0075
INNUMERABLE 	  73-YEAR-OLD  SLIVER-LIKE    GENERALLY	    UNPLEASANTNESS 0.0015
INCOMPATIBLE 	  49-YEAR-OLD  BUSINESS-LIKE  IRONICALLY    BUSINESS 0.0040
INCOMPREHENSIBLE  53-YEAR-OLD  WAR-LIKE       SPECIALLY     UNWILLINGNESS 0.025

Table 11. Most similar words using word-level embeddings learned using unlabeled English texts.

INCONSIDERABLE  00-YEAR-OLD        SHEEP-LIKE    DOMESTICALLY    UNSTEADINESS 0.0000
------------------------------------------------------------------------------
INSIGNIFICANT 	SEVENTEEN-YEAR-OLD BURROWER 	    WORLDWIDE 	 PARESTHESIA 0.00000
INORDINATE 	SIXTEEN-YEAR-OLD   CRUSTACEAN-LIKE 000,000,000   HYPERSALIVATION 0.000
ASSUREDLY 	FOURTEEN-YEAR-OLD  TROLL-LIKE      00,000,000    DROWSINESS 0.000000
UNDESERVED 	NINETEEN-YEAR-OLD  SCORPION-LIKE   SALES 	 DIPLOPIA +/-
SCRUPLE 	FIFTEEN-YEAR-OLD   UROHIDROSIS     RETAILS 	 BREATHLESSNESS -0.00

---
Table 6. Most similar words using character-level embeddings learned with Mac-Morpho Corpus.

GRADACOES     CLANDESTINAMENTE    REVOGACAO      DESLUMBRAMENTO    DROGASSE
------------------------------------------------------------------------------
TRADICOES     GLOBALMENTE         RENOVACAO      DESPOJAMENTO      DIVULGASSE
TRADUCOES     CRONOMETRICAMENTE   REAVALIACAO    DESMANTELAMENTO   DIRIGISSE
ADAPTACOES    CONDICIONALMENTE    REVITALIZACAO  DESREGRAMENTO     JOGASSE
INTRADUCOES   APARENTEMENTE       REFIGURACAO    DESENRAIZAMENTO   PROCURASSE
REDACOES      FINANCEIRAMENTE     RECOLOCACAO    DESMATAMENTO      JULGASSE
------------------------------------------------------------------------------

At word level: 

similar to
GRADACOES  ->  TONALIDADES MODULACOES CARACTERIZACOES NUANCAS COLORACOES
CLANDESTINAMENTE --> ILEGALMENTE ALI ATAMBUA BRAZZAVILLE VOLUNTARIAMENTE

Abstract

Distributed word representations have recently been proven to be an
invaluable resource for NLP. These representations are normally learned using
neural networks and capture syntactic and semantic information about
words. Information about word morphology and shape is normally ignored when
learning word representations.  However, for tasks like part-of-speech
tagging, intra-word information is extremely useful, specially when dealing
with morphologically rich languages. 

***	In this paper, we propose a deep neural network that learns
	character-level representation of words and associate them with usual
	word representations to perform POS tagging.  Using the proposed
	approach, while avoiding the use of any handcrafted feature, we
	produce stateof- the-art POS taggers for two languages: English, with
	97.32% accuracy on the Penn Treebank WSJ corpus; and Portuguese, with
	97.47% accuracy on the Mac-Morpho corpus, where the latter represents
	an error reduction of 12.2% on the best previous known result.


}}

%%%%%%%%


@inproceedings{socher-bauer-ng-manning-13_parsing-with-vector-grammars,
  title={Parsing with compositional vector grammars},
  author={Socher, Richard and Bauer, John and Manning, Christopher D and Ng, Andrew Y},
  booktitle={In Proceedings of the ACL conference},
  year={2013},
  annote = {

Abstract

Natural language parsing has typically been done with small sets of discrete
categories such as NP and VP, but this representation does not capture the
full syntactic nor semantic richness of linguistic phrases, and attempts to
improve on this by lexicalizing phrases or splitting categories only partly
address the problem at the cost of huge feature spaces and
sparseness. Instead, we introduce a Compositional Vector Grammar (CVG), which
combines PCFGs with a syntactically untied recursive neural network that
learns syntactico-semantic, compositional vector representations. The CVG
improves the PCFG of the Stanford Parser by 3.8 % to obtain an F1 score of
90.4%. It is fast to train and implemented approximately as an efficient
reranker it is about 20 % faster than the current Stanford factored
parser. The CVG learns a soft notion of head words and improves performance
on the types of ambiguities that require semantic information such as PP
attachments. 1

}}



@inproceedings{attardi-15_deepNL-deep-learning-nlp-pipeline,
  title={DeepNL: a Deep Learning NLP pipeline},
  author={Attardi, Giuseppe},
  booktitle={Proceedings of NAACL-HLT},
  pages={109--115},
  year={2015},
  annote = {

Abstract

We present the architecture of a deep learning pipeline for natural language
processing.  Based on this architecture we built a set of tools both for
creating distributional vector representations and for performing specific
NLP tasks. Three methods are available for creating embeddings: feedforward
neural network, sentiment specific embeddings and embeddings based on counts
and Hellinger PCA. Two methods are provided for training a network to perform
sequence tagging, a window approach and a convolutional approach. The window
approach is used for implementing a POS tagger and a NER tagger, the
convolutional network is used for Semantic Role Labeling.  The library is
implemented in Python with core numerical processing written in C++ using
parallel linear algebra library for efficiency and scalability.

}}


@article{tammewar-singla-15_leveraging-lexical-information-in-SMT,
  title={Methods for Leveraging Lexical Information in SMT},
  author={Aniruddha Tammewar and Karan Singla and Bhasha Agrawal and Riyaz Bhat and 
    Dipti Misra Sharma},
  year={2015},
  journal = {Workshop on Statistical Parsing of Morphologically Rich
  Languages (SPMRL 2015)}, 
  pages =  {21–30},
  annote = {

Abstract

Word Embeddings have shown to be useful in wide range of NLP tasks. We
explore the methods of using the embeddings in Dependency Parsing of Hindi, a
MoR-FWO (morphologically rich, relatively freer word order) language and show
that they not only help improve the quality of parsing, but can even act as a
cheap alternative to the traditional features which are costly to acquire. We
demonstrate that if we use distributed representation of lexical items
instead of features produced by costly tools such as Morphological Analyzer,
we get competitive results.  This implies that only mono-lingual corpus will
suffice to produce good accuracy in case of resource poor languages for which
these tools are unavailable. We also explored the importance of these
representations for domain adaptation.

---

We used MaltParser(version 1.8)3 (Nivre et al.,
2007) for training the parser and Word2Vec4 for
vector space modeling.

Chen and Manning (2014) have recently proposed a way of learning a neural
network classifier for use in a greedy, transition-based dependency
parser. They used small number of dense features, unlike most of the current
parsers which use millions of sparse indicator features. The small number of
features makes it work very fast. Use of low dimensional dense word
embeddings as features in place of sparse features has recently been used in
many NLP tasks and successfully shown improvements in terms of both, time and
accuracies.  The recent works include POS tagging (Collobert et al., 2011),
Machine Translation (Devlin et al., 2014) and Constituency Parsing (Socher et
al., 2013). These dense, continuous word-embeddings give strength to the
words to be used more effectively in statistical approaches. The problem of
data sparsity is reduced and also similarity between words can now be
calculated using vectors.

}}



@article{bisk-hockenmaier-15_probing-unsupervised-grammar-induction,
  title={Probing the Linguistic Strengths and Limitations of Unsupervised Grammar Induction},
  author={Bisk, Yonatan and Hockenmaier, Julia}
  journal={Proceedings of ACL},
  year={2015},,
  annote = {

Abstract

Work in grammar induction should help shed light on the amount of syntactic
structure that is discoverable from raw word or tag sequences. But since most
current grammar induction algorithms produce unlabeled dependencies, it is
difficult to analyze what types of constructions these algorithms can or
cannot capture, and, therefore, to identify where additional supervision may
be necessary. This paper provides an in-depth analysis of the errors made by
unsupervised CCG parsers by evaluating them against the labeled dependencies
in CCGbank, hinting at new research directions necessary for progress in
grammar induction.

}}



@inproceedings{dong-wei-15_question-answering-over-freebase-multi-column-CNN,
  title={Question Answering over Freebase with Multi-Column Convolutional Neural Networks},
  author={Dong, Li and Wei, Furu and Zhou, Ming and Xu, Ke},
  journal={Proceedings of ACL},
  year={2015},
  annote = {

... we introduce the multi-column convolutional neural networks (MCCNNs) to
automatically analyze questions from multiple aspects.  Specifically, the
model shares the same word embeddings to represent question words.  MCCNNs
use different column networks to extract answer types, relations, and context
information from the input questions. The entities and relations in the
knowledge base (namely FREEBASE in our experiments) are also represented as
low-dimensional vectors. Then, a score layer is employed to rank candidate
answers according to the representations of questions and candidate answers.
The proposed information extraction based method utilizes question-answer
pairs to automatically learn the model without relying on manually annotated
logical forms and hand-crafted features.  We also do not use any pre-defined
lexical triggers and rules. In addition, the question paraphrases are also
used to train networks and generalize for the unseen words in a multi-task
learning manner.  We have conducted extensive experiments on WEB QUESTIONS.
Experimental results illustrate that our method outperforms several baseline
systems.

... 

The models share the same word embeddings, and have multiple columns of
convolutional neural networks. The number of columns is set to three in our
QA task. These columns are used to analyze different aspects of a question,
i.e., answer path, answer context, and answer type. The vector
representations learned by these columns are denoted as f1 (q) , f2 (q) , f3
(q). We also learn embeddings for the candidate answers appeared in FREEBASE.

Abstract

Answering natural language questions over a knowledge base is an important
and challenging task. Most of existing systems typically rely on hand-crafted
features and rules to conduct question understanding and/or answer
ranking. In this paper, we introduce multi-column convolutional neural
networks (MCCNNs) to understand questions from three different aspects
(namely, answer path, answer context, and answer type) and learn their
distributed representations. Meanwhile, we jointly learn low-dimensional
embeddings of entities and relations in the knowledge base. Question-answer
pairs are used to train the model to rank candidate answers. We also leverage
question paraphrases to train the column networks in a multi-task learning
manner. We use Freebase as the knowledge base and conduct extensive
experiments on the WebQuestions dataset. Experimental results show that our
method achieves better or comparable performance compared with baseline
systems. In addition, we develop a method to compute the salience scores of
question words in different column networks. The results help us intuitively
understand what MCCNNs learn.

}}


@article{liu-jiang-15_learning-semantic-word-embeddings-from-ordinal-knowledge,
  title={Learning semantic word embeddings based on ordinal knowledge constraints},
  author={Liu, Quan and Jiang, Hui and Wei, Si and Ling, Zhen-Hua and Hu, Yu},
  journal={Proceedings of ACL, Beijing, China},
  year={2015},
  annote = {

we represent semantic knowledge as many ordinal ranking inequalities and
formulate the learning of semantic word embeddings (SWE) as a constrained
optimization problem, where the data-derived objective function is optimized
subject to all ordinal knowledge inequality constraints extracted from
available knowledge resources such as Thesaurus and Word- Net. We have
demonstrated that this constrained optimization problem can be efficiently
solved by the stochastic gradient descent (SGD) algorithm, even for a large
number of inequality constraints. Experimental results on four standard NLP
tasks, including word similarity measure, sentence completion, name entity
recognition, and the TOEFL synonym selection, have all demonstrated that the
quality of learned word vectors can be significantly improved after semantic
knowledge is incorporated as inequality constraints during the learning
process of word embeddings.

}}



@inproceedings{chen-lin-15_revisiting-word-embedding_contrasting-meaning,
  title={Revisiting word embedding for contrasting meaning},
  author={Chen, Zhigang and Lin, Wei and Chen, Qian and Chen, Xiaoping and Wei, Si and 
    Zhu, Xiaodan and Jiang, Hui},
  booktitle={Proceedings of ACL},
  year={2015},
  annote = {
}}



@inproceedings{guo-wang-15_semantically-smooth-knowledge-graph-embedding,
  title={Semantically Smooth Knowledge Graph Embedding},
  author={Guo, Shu and Wang, Quan and Wang, Bin and Wang, Lihong and Guo, Li},
  booktitle={Proceedings of ACL},
  year={2015},
  annote = {


This paper considers the problem of embedding Knowledge Graphs (KGs)
consisting of entities and relations into low-dimensional vector spaces. Most
of the existing methods perform this task based solely on observed facts. The
only requirement is that the learned embeddings should be compatible within
each individual fact. In this paper, aiming at further discovering the
intrinsic geometric structure of the embedding space, we propose Semantically
Smooth Embedding (SSE). The key idea of SSE is to take full advantage of
additional semantic information and enforce the embedding space to be
semantically smooth, i.e., entities belonging to the same semantic category
will lie close to each other in the embedding space. Two manifold learning
algorithms Laplacian Eigenmaps and Locally Linear Embedding are used to model
the smoothness assumption. Both are formulated as geometrically based
regularization terms to constrain the embedding task. We empirically evaluate
SSE in two benchmark tasks of link prediction and triple classification, and
achieve significant and consistent improvements over state-of-the-art
methods. Furthermore, SSE is a general framework. The smoothness assumption
can be imposed to a wide variety of embedding models, and it can also be
constructed using other information besides entities' semantic categories.

}}


@inproceedings{iacobacci-pilehvar-15_sensembed-embeddings-for-word-relations,
  title={SensEmbed: Learning sense embeddings for word and relational similarity},
  author={Iacobacci, Ignacio and Pilehvar, Mohammad Taher and Navigli, Roberto},
  booktitle={Proceedings of the 53rd Annual Meeting of the Association for 
    Computational Linguistics, Beijing, China},
  year={2015}
  annote = {

Word embeddings have recently gained considerable popularity for modeling
words in different Natural Language Processing (NLP) tasks including semantic
similarity measurement. However, notwithstanding their success, word
embeddings are by their very nature unable to capture polysemy, as different
meanings of a word are conflated into a single representation. In addition,
their learning process usually relies on massive corpora only, preventing
them from taking advantage of structured knowledge. We address both issues by
proposing a multi-faceted approach that transforms word embeddings to the
sense level and leverages knowledge from a large semantic network for
effective semantic similarity measurement. We evaluate our approach on word
similarity and relational similarity frameworks, reporting state-of-the-art
performance on multiple datasets. 



}}


@article{wang-15_sentiment-aspect-w-restricted-boltzmann-machines,
  title={Sentiment-Aspect Extraction based on Restricted Boltzmann Machines},
  author={Wang, Linlin and Liu, Kang and Cao, Zhu and Zhao, Jun and de Melo, Gerard},
  journal={Proceedings of ACL},
  year={2015},
  annote = {

Regarding aspect identification, previous methods can be divided into three
main categories: rule-based, supervised, and topic model-based methods. For
instance, association rule-based methods (Hu and Liu, 2004; Liu et al., 1998)
tend to focus on extracting product feature words and opinion words but
neglect connecting product features at the aspect level. Existing rule-based
methods typically are not able to group the extracted aspect terms into
categories. Supervised (Jin et al., 2009; Choi and Cardie, 2010) and
semisupervised learning methods (Zagibalov and Carroll, 2008; Mukherjee and
Liu, 2012) were introduced to resolve certain aspect identification problems.
However, supervised training requires handlabeled training data and has
trouble coping with domain adaptation scenarios.

Hence, unsupervised methods are often adopted to avoid this sort of
dependency on labeled data.  Latent Dirichlet Allocation, or LDA for short,
(Blei et al., 2003) performs well in automatically extracting aspects and
grouping corresponding representative words into categories. Thus, a number
of LDA-based aspect identification approaches have been proposed in recent
years (Brody and Elhadad, 2010; Titov and McDonald, 2008; Zhao et al.,
2010). Still, these methods have several important drawbacks. First,
inaccurate approximations of the distribution over topics may reduce the
computational accuracy. Second, mixture models are unable to exploit the
co-occurrence of topics to yield high probability predictions for words that
are sharper than the distributions predicted by individual topics (Hinton and
Salakhutdinov, 2009).

To overcome the weaknesses of existing methods and pursue the promising
direction of jointly learning aspect and sentiment, we present the novel
Sentiment-Aspect Extraction RBM (SERBM) model to simultaneously extract
aspects of entities and relevant sentiment-bearing words. This two-layer
structure model is inspired by conventional Restricted Boltzmann machines
(RBMs). In previous work, RBMs with shared parameters (RSMs) have achieved
great success in capturing distributed semantic representations from text
(Hinton and Salakhutdinov, 2009).

Abstract

Aspect extraction and sentiment analysis of reviews are both important tasks
in opinion mining. We propose a novel sentiment and aspect extraction model
based on Restricted Boltzmann Machines to jointly address these two tasks in
an unsupervised setting. This model reflects the generation process of
reviews by introducing a heterogeneous structure into the hidden layer and
incorporating informative priors. Experiments show that our model outperforms
previous state-of-the-art methods.

}}


@inproceedings{kennington-schlangen-15_compositional_perceptually-grounded-word-learning,
  title={Simple Learning and Compositional Application of Perceptually Grounded Word Meanings
    for Incremental Reference Resolution},
  author={Kennington, Casey and Schlangen, David},
  booktitle={Proceedings of the Conference for the Association for Computational
    Linguistics (ACL)},
  year={2015},
  annote = {
}}


@article{peng-chang-roth-15-CONLL_joint-coreference-and-mention-head-detection,
  title={A Joint Framework for Coreference Resolution and Mention Head Detection},
  author={Peng, Haoruo and Chang, Kai-Wei and Roth, Dan},
  journal={CoNLL 2015},
  volume={51},
  pages={12},
  year={2015},
  annote = {

"mention heads": 

“the incumbent [Barack Obama]”  --> “Barack Obama” 
“[officials] at the Pentagon” --> “officials” 

Mention heads can be used as auxiliary structures for coreference.

.... the problem of identifying mention heads is a
sequential phrase identification problem, and we choose to employ the
BILOU-representation as it has advantages over traditional
BIO-representation, as shown, e.g. in Ratinov and Roth (2009).

The BILOU representation suggests learning classifiers that identify the

B eginning, 
I nside and 
L ast tokens of multi-token chunks as well as 
U nit-length chunks.

The problem is then transformed into a simple, but
constrained, 5-class classification problem.

}}



@article{ishiwatari-kaji-15-CONLL_accurate-cross-lingual-projection_word-vectors,
  title={Accurate Cross-lingual Projection between Count-based Word Vectors by 
    Exploiting Translatable Context Pairs},
  author={Ishiwatari, Shonosuke and Kaji, Nobuhiro and Yoshinaga, Naoki and 
    Toyoda, Masashi and Kitsuregawa, Masaru},
  journal={CoNLL 2015},
  pages={300},
  year={2015},
  annote = {

****

Abstract

We propose a method that learns a crosslingual projection of word
representations from one language into another. Our method utilizes
translatable context pairs as bonus terms of the objective function.  In the
experiments, our method outperformed existing methods in three language
pairs, (English, Spanish), (Japanese, Chinese) and (English, Japanese),
without using any additional supervisions.

}}




@article{qu-L-ferraro-baldwin-15-CONLL_big-data-domain-word-repr_sequence-labelling,
  title={Big Data Small Data, In Domain Out-of Domain, Known Word Unknown Word: The
    Impact of Word Representation on Sequence Labelling Tasks},
  author={Qu, Lizhen and Ferraro, Gabriela and Zhou, Liyuan and Hou, Weiwei and 
    Schneider, Nathan and Baldwin, Timothy},
  journal={CoNLL 2015},
  year={2015},
  annote = {

Abstract

Word embeddings — distributed word representations that can be learned from
unlabelled data — have been shown to have high utility in many natural
language processing applications. In this paper, we perform an extrinsic
evaluation of four popular word embedding methods in the context of four
sequence labelling tasks: part-of-speech tagging, syntactic chunking, named
entity recognition, and multiword expression identification. A particular
focus of the paper is analysing the effects of task-based updating of word
representations.  We show that when using word embeddings as features, as few
as several hundred training instances are sufficient to achieve competitive
results, and that word embeddings lead to improvements over out-of-vocabulary
words and also out of domain. Perhaps more surprisingly, our results indicate
there is little difference between the different word embedding methods, and
that simple Brown clusters are often competitive with word embeddings across
all tasks we consider.

Possible project: POS-Tagging for Hindi w word vector models

}}



@inproceedings{johannsen-hovy-15-CONLL_cross-lingual-syntactic-variation_age-gender,
  title={Cross-lingual syntactic variation over age and gender},
  author={Johannsen, Anders and Hovy, Dirk and S{\o}gaard, Anders},
  booktitle={Proceedings of CoNLL},
  year={2015},
  annote = {

Abstract

Most computational sociolinguistics studies have focused on phonological and
lexical variation. We present the first large-scale study of syntactic
variation among demographic groups (age and gender) across several
languages. We harvest data from online user-review sites and parse it with
universal dependencies. We show that several age and gender-specific
variations hold across languages, for example that women are more likely to
use VP conjunctions.

}}


@article{duong-cohn-15-CONLL_cross-lingual-transfer-unsupervised-dependency-parsing,
  title={Cross-lingual Transfer for Unsupervised Dependency Parsing Without Parallel Data},
  author={Duong, Long and Cohn, Trevor and Bird, Steven and Cook, Paul},
  journal={CoNLL 2015},
  pages={113},
  year={2015},
  annote = {

Abstract

Cross-lingual transfer has been shown to produce good results for dependency
parsing of resource-poor languages. Although this avoids the need for a
target language treebank, most approaches have still used large parallel
corpora. However, parallel data is scarce for low-resource languages, and we
report a new method that does not need parallel data. Our method learns
syntactic word embeddings that generalise over the syntactic contexts of a
bilingual vocabulary, and incorporates these into a neural network parser. We
show empirical improvements over a baseline delexicalised parser on both the
CoNLL and Universal Dependency Treebank datasets.  We analyse the importance
of the source languages, and show that combining multiple source-languages
leads to a substantial improvement.  

}}




@article{luong-kayser-15-CONLL_deep-neural-language-models-for-translation,
  title={Deep Neural Language Models for Machine Translation},
  author={Luong, Minh-Thang and Kayser, Michael and Manning, Christopher D},
  journal={CoNLL 2015},
  pages={305},
  year={2015},
  annote = {

Abstract

Neural language models (NLMs) have been able to improve machine translation
(MT) thanks to their ability to generalize well to long contexts. Despite
recent successes of deep neural networks in speech and vision, the general
practice in MT is to incorporate NLMs with only one or two hidden layers and
there have not been clear results on whether having more layers helps. In
this paper, we demonstrate that deep NLMs with three or four layers
outperform those with fewer layers in terms of both the perplexity and the
translation quality. We combine various techniques to successfully train deep
NLMs that jointly condition on both the source and target contexts. When
reranking nbest lists of a strong web-forum baseline, our deep models yield
an average boost of 0.5 TER / 0.5 BLEU points compared to using a shallow
NLM. Additionally, we adapt our models to a new sms-chat domain and obtain a
similar gain of 1.0 TER / 0.5 BLEU points.

}}



@article{bogdanova-dos-15-CONLL_detecting-semantically-equivalent-questions,
  title={Detecting Semantically Equivalent Questions in Online User Forums},
  author={Bogdanova, Dasha and dos Santos, C{\i}cero and Barbosa, Luciano and
    Zadrozny, Bianca},
  journal={CoNLL 2015},
  pages={123},
  year={2015},
  annote = {

Abstract

Two questions asking the same thing could be too different in terms of
vocabulary and syntactic structure, which makes identifying their semantic
equivalence challenging.  This study aims to detect semantically equivalent
questions in online user forums. We perform an extensive number of
experiments using data from two different Stack Exchange forums. We compare
standard machine learning methods such as Support Vector Machines (SVM) with
a convolutional neural network (CNN). The proposed CNN generates distributed
vector representations for pairs of questions and scores them using a
similarity metric.  We evaluate in-domain word embeddings versus the ones
trained with Wikipedia, estimate the impact of the training set size, and
evaluate some aspects of domain adaptation. Our experimental results show
that the convolutional neural network with in-domain word embeddings achieves
high performance even with limited training data.

}}




@article{mihaylov-georgiev-15-CONLL_finding-opinion-manipulation-trolls-in-forums,
  title={Finding Opinion Manipulation Trolls in News Community Forums},
  author={Mihaylov, Todor and Georgiev, Georgi D and Ontotext, AD and Nakov, Preslav and
    HBKU, Qatar},
  journal={CoNLL 2015},
  pages={310},
  year={2015},
  annote = {

Trolls: Humans inserted into a social forum to create discord...

We considered as trolls users who were called
such by at least n distinct users, and non-trolls if
they have never been called so. Requiring that a
user should have at least 100 comments in order to
be interesting for our experiments left us with 317
trolls and 964 non-trolls. Here are two examples
(translated):

“To comment from ”Rozalina”: You, trolls, are so funny :) I saw the same
signature under other comments:)”

“To comment from ”Historama”: Murzi7, you know that you cannot manipulate
public opinion, right?”

[FN7. Murzi is the short for murzilka. According to series of recent
publications in Bulgarian media, Russian Internet users
reportedly use the term murzilka to refer to Internet trolls. 

Abstract

The emergence of user forums in electronic news media has given rise to the
proliferation of opinion manipulation trolls. Finding such trolls
automatically is a hard task, as there is no easy way to recognize or even to
define what they are; this also makes it hard to get training and testing
data. We solve this issue pragmatically: we assume that a user who is called
a troll by several people is likely to be one.  We experiment with different
variations of this definition, and in each case we show that we can train a
classifier to distinguish a likely troll from a non-troll with very high
accuracy, 82–95%, thanks to our rich feature set.

}}




@article{klinger-cimiano-15-CONLL_instance-selection_for_cross-lingual-sentiment,
  title={Instance Selection Improves Cross-Lingual Model Training for Fine-Grained 
    Sentiment Analysis},
  author={Klinger, Roman and Cimiano, Philipp},
  journal={CoNLL 2015},
  pages={153},
  year={2015},
  annote = {

Abstract

Scarcity of annotated corpora for many languages is a bottleneck for training
finegrained sentiment analysis models that can tag aspects and subjective
phrases. We propose to exploit statistical machine translation to alleviate
the need for training data by projecting annotated data in a source language
to a target language such that a supervised fine-grained sentiment analysis
system can be trained. To avoid a negative influence of poor-quality
translations, we propose a filtering approach based on machine translation
quality estimation measures to select only high-quality sentence pairs for
projection. We evaluate on the language pair German/English on a corpus of
product reviews annotated for both languages and compare to
in-target-language training. Projection without any filtering leads to 23% F1
in the task of detecting aspect phrases, compared to 41% F1 for
in-target-language training. Our approach obtains up to 47% F1. Further, we
show that the detection of subjective phrases is competitive to
in-target-language training without filtering.

}}



@article{cotterell-muller-15-CONLL_labeled-morphological-segmentation-HMM,
  title={Labeled Morphological Segmentation with Semi-Markov Models},
  author={Cotterell, Ryan and M{\"u}ller, Thomas and Fraser, Alexander and 
    Sch{\"u}tze, Hinrich},
  journal={CoNLL 2015},
  pages={164},
  year={2015},
  annote = {

Abstract

We present labeled morphological segmentation—an alternative view of
morphological processing that unifies several tasks. We introduce a new
hierarchy of morphotactic tagsets and CHIPMUNK, a discriminative
morphological segmentation system that, contrary to previous work, explicitly
models morphotactics.  We show improved performance on three tasks for all
six languages: (i) morphological segmentation, (ii) stemming and (iii)
morphological tag classification. For morphological segmentation our method
shows absolute improvements of 2-6 points F1 over a strong baseline.

}}



@article{maillard-clark-15-CONLL_learning-adjectives-w-tensor-skip-gram,
  title={Learning Adjective Meanings with a Tensor-Based Skip-Gram Model},
  author={Maillard, Jean and Clark, Stephen},
  journal={CoNLL 2015},
  pages={327},
  year={2015},
  annote = {

***

combining compositional and distributional semantics.  

A proposal for the case of adjective-noun combinations is given by Baroni and
Zamparelli (2010) (and also Guevara (2010)). Their model represents
adjectives as matrices over noun space, trained via linear regression to
approximate the “holistic” adjective-noun vectors from the corpus.

In this paper we propose a new solution to the problem of learning adjective
meaning representations.  The model is an implementation of the tensor
framework of Coecke et al. (2011), here applied to adjective-noun
combinations as a starting point. 

2 A tensor-based skip-gram model

Our model treats adjectives as linear maps over the vector space of noun
meanings, encoded as matrices.  The algorithm works in two stages: the first
stage learns the noun vectors, as in a standard skipgram model, and the
second stage learns the adjective matrices, given fixed noun vectors.

}}


@article{yin-W-schuetze-14-ACL_exploration-of-embeddings-for-generalized-phrases,
  title={An Exploration of Embeddings for Generalized Phrases},
  author={Yin, Wenpeng and Sch{\"u}tze, Hinrich},
  journal={ACL 2014},
  pages={41},
  year={2014},
  annote = {

}}



@article{yin-W-schuetze-15-CONLL_multichannel-convolution-for-sentence-vector,
  title={Multichannel Variable-Size Convolution for Sentence Classification},
  author={Wenpeng Yin and Hinrich Schütze },
  journal={CoNLL 2015},
  year={2015},
  annote = {

Word embeddings are derived by
projecting words from a sparse, 1-of-V encoding (V : vocabulary size) onto a
lower dimensional and dense vector space via hidden layers and can be
interpreted as feature extractors that encode semantic and syntactic features
of words.

Many papers study the comparative performance of different versions of word
embeddings, usually learned by different neural network (NN)
architectures. 
We want to leverage this diversity of different embedding versions to extract
higher quality sentence features and thereby improve sentence classification
performance.

The letters “M” and “V” in the name “MVCNN” of our architecture denote the
multichannel and variable-size convolution filters, respectively.

“Multichannel” employs language from computer vision where a color image has
red, green and blue channels. Here, a channel is a description by an
embedding version.

In sum, we attribute the success of MVCNN to: (i) designing variable-size
convolution filters to extract variable-range features of sentences and (ii)
exploring the combination of multiple public embedding versions to initialize
words in sentences.  We also employ two “tricks” to further enhance system
performance: mutual learning and pretraining.

Multichannel Input. The input of MVCNN includes multichannel feature maps of
a considered sentence, each is a matrix initialized by a different embedding
version. Let s be sentence length, d dimension of word embeddings and c the
total number of different embedding versions (i.e., channels). Hence, the
whole initialized input is a three-dimensional array of size c⇥d⇥s. Figure 1
depicts a sentence with s = 12 words. Each word is initialized by c = 5
embeddings, each coming from a different channel. In implementation,
sentences in a mini-batch will be padded to the same length, and unknown
words for corresponding channel are randomly initialized or can acquire good
initialization from the mutual-learning phase described in next section.

}}


@article{weegar-astrom-15-CONLL_linking-entities-across-images-and-text,
  title={Linking Entities Across Images and Text},
  author={Weegar, Rebecka and Astr{\"o}m, Kalle and Nugues, Pierre},
  journal={CoNLL 2015},
  pages={185},
  year={2015},
  annote = {

*** get IAPR TC-12 dataset

Figure 1 [image of clouds and grass etc. 

shows an example of an image from the Segmented and Annotated IAPR TC-12 data
set (Escalantea et al., 2010). It has four regions labeled cloud, grass,
hill, and river, and the caption:

	a flat landscape with a dry meadow in the foreground, a lagoon behind
	it and many clouds in the sky

Karpathy et al. (2014) proposed a system for retrieving related images and
sentences. They used neural networks and they show that the results are
improved if image objects and sentence fragments are included in the
model. Sentence fragments are extracted from dependency graphs, where each
edge in the graphs corresponds to a fragment.

Abstract

This paper describes a set of methods to link entities across images and
text. As a corpus, we used a data set of images, where each image is
commented by a short caption and where the regions in the images are manually
segmented and labeled with a category. We extracted the entity mentions from
the captions and we computed a semantic similarity between the mentions and
the region labels. We also measured the statistical associations between
these mentions and the labels and we combined them with the semantic
similarity to produce mappings in the form of pairs consisting of a region
label and a caption entity. In a second step, we used the syntactic
relationships between the mentions and the spatial relationships between the
regions to rerank the lists of candidate mappings. To evaluate our methods,
we annotated a test set of 200 images, where we manually linked the image
regions to their corresponding mentions in the captions. Eventually, we could
match objects in pictures to their correct mentions for nearly 89 percent of
the segments, when such a matching exists.

}}




@article{chatterji-sarkar-basu-14_dependency-annotation_bangla-treebank,
  title={A dependency annotation scheme for Bangla treebank},
  author={Chatterji, Sanjay and Sarkar, Tanaya Mukherjee and Dhang, Pragati and 
    Deb, Samhita and Sarkar, Sudeshna and Chakraborty, Jayshree and Basu, Anupam},
  journal={Language Resources and Evaluation},
  volume={48},
  number={3},
  pages={443--477},
  year={2014},
  publisher={Springer},
  annote = {
}}


@inproceedings{al-Badrashiny-eskander-14_automatic-transliteration_romanized-arabic,
  title={Automatic transliteration of romanized dialectal arabic},
  author={Al-Badrashiny, Mohamed and Eskander, Ramy and Habash, Nizar and Rambow, Owen},
  booktitle={Proceedings of the Eighteenth Conference on Computational Natural Language
    Learning},
  pages={30--38},
  year={2014},
  annote = {
}}


@article{levy-goldberg-14_linguistic-regularities-in-sparse-explicit-word-reprs,
  title={Linguistic regularities in sparse and explicit word representations},
  author={Levy, Omer and Goldberg, Yoav and Ramat-Gan, Israel},
  journal={CoNLL-2014},
  pages={171},
  year={2014},
  annote = {

Recent work has shown that neural-embedded word representations capture many
relational similarities, which can be recovered by means of vector arithmetic
in the embedded space. We show that Mikolov et al.’s method of first adding
and subtracting word vectors, and then searching for a word similar to the
result, is equivalent to searching for a word that maximizes a linear
combination of three pairwise word similarities. Based on this observation,
we suggest an improved method of recovering relational similarities,
improving the state-of-the-art results on two recent word-analogy datasets.
Moreover, we demonstrate that analogy recovery is not restricted to neural
word embeddings, and that a similar amount of relational similarities can be
recovered from traditional distributional word representations


related work with datasets and code: 

https://levyomer.wordpress.com/2015/06/23/learning-to-exploit-structured-resources-for-lexical-inference/

}}


@inproceedings{kulkarni-alRfou-perozzi-15_statistically-significant-linguistic-change,
  title={Statistically Significant Detection of Linguistic Change},
  author={Kulkarni, Vivek and Al-Rfou, Rami and Perozzi, Bryan and Skiena, Steven},
  booktitle={Proceedings of the 24th International Conference on World Wide Web},
  pages={625--635},
  year={2015},
  organization={International World Wide Web Conferences Steering Committee},
  annote = {


We propose a new computational approach for tracking and detecting
statistically significant linguistic shifts in the meaning and usage of
words. Such linguistic shifts are especially prevalent on the Internet, where
the rapid exchange of ideas can quickly change a word's meaning. Our
meta-analysis approach constructs property time series of word usage, and
then uses statistically sound change point detection algorithms to identify
significant linguistic shifts. We consider and analyze three approaches of
increasing complexity to generate such linguistic property time series, the
culmination of which uses distributional characteristics inferred from word
co-occurrences. Using recently proposed deep neural language models, we first
train vector representations of words for each time period. Second, we warp
the vector spaces into one unified coordinate system. Finally, we construct a
distance-based distributional time series for each word to track its
linguistic displacement over time.

We demonstrate that our approach is scalable by tracking linguistic change
across years of micro-blogging using Twitter, a decade of product reviews
using a corpus of movie reviews from Amazon, and a century of written books
using the Google Book Ngrams. Our analysis reveals interesting patterns of
language usage change commensurate with each medium.

}}




@article{alRfou-kulkarni-14_polyglot-NER-massive-multilingual-named-entity-recognition,
  title={POLYGLOT-NER: Massive Multilingual Named Entity Recognition},
  author={Al-Rfou, Rami and Kulkarni, Vivek and Perozzi, Bryan and Skiena, Steven},
  journal={SIAM Intl Conf Data Mining SDM 2015},
  year={2014},
  annote = {

also: arXiv preprint arXiv:1410.3791

We build a Named Entity Recognition system (NER) for 40 languages using only
language agnostic methods. Our system relies only on un-supervised methods
for feature generation. We obtain training data for the task of NER through a
semi-supervised technique not relying whatsoever on any language specific or
orthographic features. This approach allows us to scale to large set of
languages for which little human expertise and human annotated training data
is available

}}


@incollection{perozzi-alRfou-14_inducing-language-networks-from-continuous-word-reprs,
  title={Inducing language networks from continuous space word representations},
  author={Perozzi, Bryan and Al-Rfou, Rami and Kulkarni, Vivek and Skiena, Steven},
  booktitle={Complex Networks V},
  pages={261--273},
  year={2014},
  publisher={Springer},
  annote = {

We induced networks on continuous space representations of words over the
Polyglot and Skipgram models. We compared the structural properties of these
networks and demonstrate that these networks differ from networks constructed
through other run of the mill methods. We also demonstrated that these
networks exhibit a rich and varied community structure.

}}




@article{xiao-guo-15-CONLL_annotation-projection-for-cross-lingual-parsing,
  title={Annotation Projection-based Representation Learning for Cross-lingual 
    Dependency Parsing},
  author={Xiao, Min and Guo, Yuhong},
  journal={CoNLL 2015},
  pages={73},
  year={2015},
  annote = {

Abstract

Cross-lingual dependency parsing aims to train a dependency parser for an
annotation-scarce target language by exploiting annotated training data from
an annotation-rich source language, which is of great importance in the field
of natural language processing. In this paper, we propose to address
cross-lingual dependency parsing by inducing latent crosslingual data
representations via matrix completion and annotation projections on a large
amount of unlabeled parallel sentences.  To evaluate the proposed learning
technique, we conduct experiments on a set of cross-lingual dependency
parsing tasks with nine different languages. The experimental results
demonstrate the efficacy of the proposed learning method for cross-lingual
dependency parsing.

}}


@article{bhatt-semwal-15-CONLL_iterative-similarity_cross-domain-text-classification,
  title={An Iterative Similarity based Adaptation Technique for Cross Domain Text
    Classification},
  author={Bhatt, Himanshu S and Semwal, Deepali and Roy, Shourya},
  journal={CoNLL 2015},
  pages={52},
  year={2015},
  annote = {
}}



@article{lim-T-soon-LK-14_lexicon+-tx_rapid-multilingual-lexicon-for-under-resourced-lgs,
  title={Lexicon+ TX: rapid construction of a multilingual lexicon with under-resourced 
    languages},
  author={Lim, Lian Tze and Soon, Lay-Ki and Lim, Tek Yong and Tang, Enya Kong and
    Ranaivo-Malan{\c{c}}on, Bali},
  journal={Language Resources and Evaluation},
  volume={48},
  number={3},
  pages={479--492},
  year={2014},
  publisher={Springer},
  annote = {
}}




@inproceedings{bhatt-narasimhan-misra-09_multi-representational-treebank-for-hindi-urdu,
  title={A multi-representational and multi-layered treebank for hindi/urdu},
  author={Bhatt, Rajesh and Narasimhan, Bhuvana and Palmer, Martha and Rambow, Owen and
    Sharma, Dipti Misra and Xia, Fei},
  booktitle={Proceedings of the Third Linguistic Annotation Workshop},
  pages={186--189},
  year={2009},
  organization={Association for Computational Linguistics},
  annote = {

This paper describes the simultaneous development of dependency structure and
phrase structure treebanks for Hindi and Urdu, as well as a PropBank. The
dependency structure and the PropBank are manually annotated, and then the
phrase structure treebank is produced automatically. To ensure successful
conversion the development of the guidelines for all three representations
are carefully coordinated.

}}


@article{mikolov-le-Q-13_similarities-in-lg-for-machine-translation,
  title={Exploiting similarities among languages for machine translation},
  author={Mikolov, Tomas and Le, Quoc V and Sutskever, Ilya},
  journal={arXiv preprint arXiv:1309.4168},
  year={2013},
  annote = {
}}
CS671: Natural Language Processing

Department of Computer Science & Engineering, IIT Kanpur

Jul - Nov 2015

Homework 3 - Paper Review

List of papers

WORD EMBEDDING (WORD VECTOR ELARNING)

VECTORS FOR PHRASES / SENTENCES

WORD VECTOR POS-TAGGING

WORD VECTOR FOR SYNTAX

SEMANTICS

SENTIMENT

MORPHOLOGY

CROSS-LANGUAGE MODELS

HINDI / BANGLA

TRANSLATION

MULTIMODAL (IMAGES + LANGUAGE)

List of papers bibTeX