Homework 3 - Paper Review
Update: Discussion groups
In this homework / project preparation, you should read a paper from
an NLP conference and write a small ppt presentation. This presentation
should be done within your group and should be discussed by everyone.
They should then be improved based on the feedback.
One or two presentations from each group will be presented in class.
To select a paper from the list below, fill out
this
form, ensuring that no one before you has
selected
the paper.
The pdf of most of these papers is available in the directory
http://web.cse.iitk.ac.in/users/cs671/15_papers/
Due date: PPTs should be uploaded by August 27.
Presentations should be discussed by each group in class on August 28.
Should be
improved and presented to next class. Discussion groups (Not project
groups!) will be formed by Aug 20 based on your paper selections.
List of papers
- santos-15_boosting-NER-w-neural-character-embeddings
- guo-che-14_revisiting-embedding-for-semi-supervised-learning
- liu-jiang-15_learning-semantic-word-embeddings-from-ordinal-knowledge
- chen-lin-15_revisiting-word-embedding_contrasting-meaning
- levy-goldberg-14_linguistic-regularities-in-sparse-explicit-word-reprs
- perozzi-alRfou-14_inducing-language-networks-from-continuous-word-reprs
- iacobacci-pilehvar-15_sensembed-embeddings-for-word-relations
- astudillo-amir-15_learning-word-repr-from-scarce-and-noisy-data
- levy-goldberg-14-NIPS_neural-word-embedding-as-matrix-factorization
- stratos-collins-15-ACL_word-embeddings-from-count-matrices
- gouws-sogaard-15-naacl_task-specific-bilingual-word-embeddings
VECTORS FOR PHRASES / SENTENCES
- kiros-salakhutdinov-14_multimodal-neural-language-models
- kalchbrenner-grefenstette-14_CNN-for-modelling-sentences
- yin-W-schuetze-15-CONLL_multichannel-convolution-for-sentence-vector
- le-P-zuidema-15_compositional-distributional-semanticsw-LSTM
WORD VECTOR POS-TAGGING
- fonseca-rosa-15_evaluating-word-embeddings_POS-tagging-in-portuguese
- yin-W-schuetze-14-ACL_exploration-of-embeddings-for-generalized-phrases
- lin-ammar-15_unsupervised-pos-induction-with-word-embeddings
- santos-zadrozny-14-icml_character-level-repre-for-POS-tagging-portuguese
WORD VECTOR FOR SYNTAX
- qu-L-ferraro-baldwin-15-CONLL_big-data-domain-word-repr_sequence-labelling
- maillard-clark-15-CONLL_learning-adjectives-w-tensor-skip-gram
- iyyer-manjunatha-15-deep-unordered-composition-rivals-syntax
- andreas-klein-14_much-word-embeddings-encode-syntax
- ling-dyer-15_two-simple-adaptations-of-word2v-ec-for-syntax
- bansal-gimpel-14_tailoring-word-vectors-for-dependency-parsing
- socher-bauer-ng-manning-13_parsing-with-vector-grammars
- attardi-15_deepNL-deep-learning-nlp-pipeline
- bisk-hockenmaier-15_probing-unsupervised-grammar-induction
- bride-cruys-asher-15-ACL_lexical-composition-in-distributional-semantics
SEMANTICS
- neelakantan-roth-15_compositional-vectors_knowledge-base-completion
- bing-li-P-15-ACL_abstractive-multi-document-summarization_phrase-merging
- purver-sadrzadeh-15_distributional-semantics-to-pragmatics
- miller-gurevych-15_automatic-disambiguation-of-english-puns
- zhou-xu-15_end-to-end-semantic-role-labeling-w-RNN
- wang-berant-15_building-semantic-parser-overnight
- mitchell-steedman-15_orthogonal-syntax-and-semantics-in-distributional-spaces
- camacho-pilehvar-15-ACL_unified-multilingual-semantics-of-concepts
- tackstrom-ganchev-das-15_efficient-structured-learning-for-semantic-role-labeling
- yang-aloimonos-aksoy-15-ACL_learning-semantics-of-manipulation
- quirk-mooney-15_language-to-code_semantic-parsers-for-if-else--recipes
- guo-wang-15_semantically-smooth-knowledge-graph-embedding
- dong-wei-15_question-answering-over-freebase-multi-column-CNN
- peng-chang-roth-15-CONLL_joint-coreference-and-mention-head-detection
- bogdanova-dos-15-CONLL_detecting-semantically-equivalent-questions
- mihaylov-georgiev-15-CONLL_finding-opinion-manipulation-trolls-in-forums
- kulkarni-alRfou-perozzi-15_statistically-significant-linguistic-change
- alRfou-kulkarni-14_polyglot-NER-massive-multilingual-named-entity-recognition
SENTIMENT
- wang-15_sentiment-aspect-w-restricted-boltzmann-machines
- zhou-chen-15_learning-bilingual-sentiment-word-embeddings
- klinger-cimiano-15-CONLL_instance-selection_for_cross-lingual-sentiment
- labutov-basu-15-ACL_deep-questions-without-deep-understanding
- persing-ng-V-15_modeling-argument-strength-in-student-essays
MORPHOLOGY
- ma-J-hinrichs-15-ACL_accurate-linear-time-chinese-word-segm-w-WVs
- cotterell-muller-15-CONLL_labeled-morphological-segmentation-HMM
CROSS-LANGUAGE MODELS
- deMarneffe-dozat-14_universal-dependencies_cross-linguistic
- ishiwatari-kaji-15-CONLL_accurate-cross-lingual-projection_word-vectors
- johannsen-hovy-15-CONLL_cross-lingual-syntactic-variation_age-gender
- duong-cohn-15-CONLL_cross-lingual-transfer-unsupervised-dependency-parsing
- xiao-guo-15-CONLL_annotation-projection-for-cross-lingual-parsing
HINDI / BANGLA
- chatterji-sarkar-basu-14_dependency-annotation_bangla-treebank
- senapati-garain-14_maximum-entropy-based-honorificity_bengali-anaphora
- bhatt-semwal-15-CONLL_iterative-similarity_cross-domain-text-classification
- lim-T-soon-LK-14_lexicon+-tx_rapid-multilingual-lexicon-for-under-resourced-lgs
- bhatt-narasimhan-misra-09_multi-representational-treebank-for-hindi-urdu
- tammewar-singla-15_leveraging-lexical-information-in-SMT
TRANSLATION
- luong-sutskever-le-15-ACL_rare-word-problem-in-neural-machine-translation
- luong-kayser-15-CONLL_deep-neural-language-models-for-translation
- al-Badrashiny-eskander-14_automatic-transliteration_romanized-arabic
- mikolov-le-Q-13_similarities-in-lg-for-machine-translation
MULTIMODAL (IMAGES + LANGUAGE)
- kiros-salakhutdinov-14_multimodal-neural-language-models
- li-chen-15-ACL_visualizing-and-understanding-neural-models-in-NLP
- shutova-tandon-15_perceptually-grounded-selectional-preferences
- kennington-schlangen-15_compositional_perceptually-grounded-word-learning
- elliott-de-15_describing-images-using-inferred-visual-dependency-representations
- weegar-astrom-15-CONLL_linking-entities-across-images-and-text
WORD EMBEDDING (WORD VECTOR ELARNING)
List of papers bibTeX
@InProceedings{yang-aloimonos-aksoy-15-ACL_learning-semantics-of-manipulation, author = {Yang, Yezhou and Aloimonos, Yiannis and Fermuller, Cornelia and Aksoy, Eren Erdal}, title = {Learning the Semantics of Manipulation Action}, booktitle = {Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)}, month = {July}, year = {2015}, address = {Beijing, China}, publisher = {Association for Computational Linguistics}, pages = {676--686}, url = {http://www.aclweb.org/anthology/P15-1066} annote = { Abstract In this paper we present a formal computational framework for modeling manipulation actions. The introduced formalism leads to semantics of manipulation action and has applications to both observing and understanding human manipulation actions as well as executing them with a robotic mechanism (e.g. a humanoid robot). It is based on a Combinatory Categorial Grammar. The goal of the introduced framework is to: (1) represent manipulation actions with both syntax and semantic parts, where the semantic part employs -calculus; (2) enable a probabilistic semantic parsing schema to learn the lambda-calculus representation of manipulation action from an annotated action corpus of videos; (3) use (1) and (2) to develop a system that visually observes manipulation actions and understands their meaning while it can reason beyond observations using propositional logic and axiom schemata. The experiments conducted on a public available large manipulation action dataset validate the theoretical framework and our implementation. }} @article{quirk-mooney-15_language-to-code_semantic-parsers-for-if-else--recipes, title={Language to Code: Learning Semantic Parsers for If-This-Then-That Recipes}, author={Quirk, Chris and Mooney, Raymond and Galley, Michel} booktitle={Proceedings of ACL}, year={2015}, annote = { Abstract Using natural language to write programs is a touchstone problem for computational linguistics. We present an approach that learns to map natural-language descriptions of simple “if-then” rules to executable code. By training and testing on a large corpus of naturally-occurring programs (called “recipes”) and their natural language descriptions, we demonstrate the ability to effectively map language to code. We compare a number of semantic parsing approaches on the highly noisy training data collected from ordinary users, and find that loosely synchronous systems perform best. }} @article{zhou-chen-15_learning-bilingual-sentiment-word-embeddings, title={Learning Bilingual Sentiment Word Embeddings for Cross-language Sentiment Classification}, author={Zhou, Huiwei and Chen, Long and Shi, Fulin and Huang, Degen} booktitle={Proceedings of ACL}, year={2015}, annote = { Requires parallel corpus only. --- The sentiment classification performance relies on high-quality sentiment resources. However, these resources are imbalanced in different languages. Cross-language sentiment classification (CLSC) can leverage the rich resources in one language (source language) for sentiment classification in a resource-scarce language (target language). Bilingual embeddings could eliminate the semantic gap between two languages for CLSC, but ignore the sentiment information of text. This paper proposes an approach to learning bilingual sentiment word embeddings (BSWE) for English-Chinese CLSC. The proposed BSWE incorporate sentiment information of text into bilingual embeddings. Furthermore, we can learn high-quality BSWE by simply employing labeled corpora and their translations, without relying on large scale parallel corpora. Experiments on NLP&CC 2013 CLSC dataset show that our approach outperforms the state-of-theart systems. }} @article{iyyer-manjunatha-15-deep-unordered-composition-rivals-syntax}, title={Deep Unordered Composition Rivals Syntactic Methods for Text Classification}, author={Iyyer, Mohit and Manjunatha, Varun and Boyd-Graber, Jordan and III, Hal Daum{\'e}}, booktitle={Proceedings of ACL}, year={2015}, annote = { }} @article{elliott-de-15_describing-images-using-inferred-visual-dependency-representations, title={Describing Images using Inferred Visual Dependency Representations}, author={Elliott, Desmond and de Vries, Arjen P}, year={2015}, booktitle={Proc. ACL-15}, annote = { ABSTRACT Many existing deep learning models for natural language processing tasks focus on learning the compositionality of their inputs, which requires many expensive computations. We present a simple deep neural network that competes with and, in some cases, outperforms such models on sentiment analysis and factoid question answering tasks while taking only a fraction of the training time. While our model is syntactically-ignorant, we show significant improvements over previous bag-of-words models by deepening our network and applying a novel variant of dropout. Moreover, our model performs better than syntactic models on datasets with high syntactic variance. We show that our model makes similar errors to syntactically-aware models, indicating that for the tasks we consider, nonlinearly transforming the input is more important than tailoring a network to incorporate word order and syntax. }} @article{zhou-xu-15_end-to-end-semantic-role-labeling-w-RNN, title={End-to-end Learning of Semantic Role Labeling Using Recurrent Neural Networks}, author={Zhou, Jie and Xu, Wei} booktitle={Proceedings of ACL}, year={2015}, annote = { }} @article{wang-berant-15_building-semantic-parser-overnight, title={Building a Semantic Parser Overnight}, author={Wang, Yushi and Berant, Jonathan and Liang, Percy} booktitle={Proceedings of ACL}, year={2015}, annote = { Abstract How do we build a semantic parser in a new domain starting with zero training examples? We introduce a new methodology for this setting: First, we use a simple grammar to generate logical forms paired with canonical utterances. The logical forms are meant to cover the desired set of compositional operators, and the canonical utterances are meant to capture the meaning of the logical forms (although clumsily). We then use crowdsourcing to paraphrase these canonical utterances into natural utterances. The resulting data is used to train the semantic parser. We further study the role of compositionality in the resulting paraphrases. Finally, we test our methodology on seven domains and show that we can build an adequate semantic parser in just a few hours. crowdsourced dataset is available, also code. }} @article{miller-gurevych-15_automatic-disambiguation-of-english-puns, title={Automatic disambiguation of English puns}, author={Miller, Tristan and Gurevych, Iryna} booktitle={Proceedings of ACL}, year={2015}, annote = { Abstract Traditional approaches to word sense disambiguation (WSD) rest on the assumption that there exists a single, unambiguous communicative intention underlying every word in a document. However, writers sometimes intend for a word to be interpreted as simultaneously carrying multiple distinct meanings. This deliberate use of lexical ambiguity—i.e., punning— is a particularly common source of humour. In this paper we describe how traditional, language-agnostic WSD approaches can be adapted to “disambiguate” puns, or rather to identify their double meanings. We evaluate several such approaches on a manually sense-annotated collection of English puns and observe performance exceeding that of some knowledge-based and supervised baselines. annotated pun database probably not available yet: Pending clearance of the distribution rights, we will make some or all of our annotated data set available on our website at https://www.ukp.tu-darmstadt.de/data/. }} @article{neelakantan-roth-15_compositional-vectors_knowledge-base-completion, title={Compositional Vector Space Models for Knowledge Base Completion}, author={Neelakantan, Arvind and Roth, Benjamin and McCallum, Andrew}, booktitle={Proceedings of ACL}, year={2015}, annote = { Abstract Knowledge base (KB) completion adds new facts to a KB by making inferences from existing facts, for example by inferring with high likelihood nationality(X,Y) from bornIn(X,Y). Most previous methods infer simple one-hop relational synonyms like this, or use as evidence a multi-hop relational path treated as an atomic feature, like bornIn(X,Z)!containedIn(Z,Y). This paper presents an approach that reasons about conjunctions of multi-hop relations non-atomically, composing the implications of a path using a recurrent neural network (RNN) that takes as inputs vector embeddings of the binary relation in the path. Not only does this allow us to generalize to paths unseen at training time, but also, with a single high-capacity RNN, to predict new relation types not seen when the compositional model was trained (zero-shot learning). We assemble a new dataset of over 52M relational triples, and show that our method improves over a traditional classifier by 11%, and a method leveraging pre-trained embeddings by 7%. {arXiv preprint arXiv:1504.06662}, }} @article{labutov-basu-15-ACL_deep-questions-without-deep-understanding, title={Deep Questions without Deep Understanding}, author={Labutov, Igor and Basu, Sumit and Vanderwende, Lucy} booktitle={Proceedings of ACL}, year={2015}, annote = { develops an approach for generating deep (i.e, high-level) comprehension questions from novel text that bypasses the myriad challenges of creating a full semantic representation. 1. represents the original text in a low-dimensional ontology, 2. crowd-sources candidate question templates aligned with that space 3. ranks potentially relevant templates for a novel region of text If ontological labels are not available, we infer them from the text. We demonstrate the effectiveness of this method on a corpus of articles from Wikipedia alongside human judgments, and find that we can generate relevant deep questions with a precision of over 85% while maintaining a recall of 70%. --- uses section titles (e.g. wikipedia) crowdsourcing: after reading a section, they are asked to "write a follow-up test question on ..." --> creates ~ 1K templates for "deeper", non-factoid questions. ==some of the questions generated== What accomplishments characterized the later ca-reer of Colin Cowdrey? Acting Career How did Corbin Bern-stein’s television career evolve over time? What are some unique ge-ographic features of Puerto Rico Highway 10? How much significance do people of DeMartha Cath-olic High School place on athletics? How does the geography of Puerto Rico Highway 10 impact its resources? What type of reaction did Thornton Dial receive? What were the most im-portant events in the later career of Corbin Berstein? What geographic factors influence the preferred transport methods in Wey-mouth Railway Station? How has Freddy Mitch-ell’s legacy shaped current events? }} @article{astudillo-amir-15_learning-word-repr-from-scarce-and-noisy-data, title={Learning Word Representations from Scarce and Noisy Data with Embedding Sub-spaces}, author={Astudillo, Ramon F and Amir, Silvio and Lin, Wang and Silva, M{\'a}rio and Trancoso, Isabel} booktitle={Proceedings of ACL}, year={2015}, annote = { ABSTRACT We investigate a technique to adapt unsupervised word embeddings to specific applications, when only small and noisy labeled datasets are available. Current methods use pre-trained embeddings to initialize model parameters, and then use the labeled data to tailor them for the intended task. However, this approach is prone to overfitting when the training is performed with scarce and noisy data. To overcome this issue, we use the supervised data to find an embedding subspace that fits the task complexity. All the word representations are adapted through a projection into this task-specific subspace, even if they do not occur on the labeled dataset. This approach was recently used in the SemEval 2015 Twitter sentiment analysis challenge, attaining state-of-the-art results. Here we show results improving those of the challenge, as well as additional experiments in a Twitter Part-Of-Speech tagging task. }} @article{persing-ng-V-15_modeling-argument-strength-in-student-essays, title={Modeling Argument Strength in Student Essays}, author={Persing, Isaac and Ng, Vincent} booktitle={Proceedings of ACL}, year={2015}, annote = { Abstract While recent years have seen a surge of interest in automated essay grading, including work on grading essays with respect to particular dimensions such as prompt adherence, coherence, and technical quality, there has been relatively little work on grading the essay dimension of argument strength, which is arguably the most important aspect of argumentative essays. We introduce a new corpus of argumentative student essays annotated with argument strength scores and propose a supervised, feature-rich approach to automatically scoring the essays along this dimension. Our approach significantly outperforms a baseline that relies solely on heuristically applied sentence argument function labels by up to 16.1%. }} @article{mitchell-steedman-15_orthogonal-syntax-and-semantics-in-distributional-spaces, title={Orthogonality of Syntax and Semantics within Distributional Spaces}, author={Mitchell, Jeff and Steedman, Mark} booktitle={Proceedings of ACL}, year={2015}, annote = { }} @article{bing-li-P-15-ACL_abstractive-multi-document-summarization_phrase-merging, title={Abstractive Multi-Document Summarization via Phrase Selection and Merging }, author={Bing, Lidong and Li, Piji and Liao, Yi and Lam, Wai and Guo, Weiwei and Passonneau, Rebecca J}, booktitle={Proceedings of ACL}, year={2015}, annote = { arXiv preprint arXiv:1506.01597 }} @article{luong-sutskever-le-15-ACL_rare-word-problem-in-neural-machine-translation, title={Addressing the Rare Word Problem in Neural Machine Translation}, author={Luong, Thang and Sutskever, Ilya and Le, Quoc V and Vinyals, Oriol and Zaremba, Wojciech}, booktitle={Proceedings of ACL}, year={2014}, annote = { 2014 arXiv preprint arXiv:1410.8206 }} @article{ma-J-hinrichs-15-ACL_accurate-linear-time-chinese-word-segm-w-WVs, title={Accurate Linear-Time Chinese Word Segmentation via Embedding Ma2015tching}, author={Ma, Jianqiang and Hinrichs, Erhard} booktitle={Proceedings of ACL}, year={2015}, annote = { }} @article{li-luong-15-ACL_hierarchical-autoencoder-for-paragraphs-and-documents, title={A Hierarchical Neural Autoencoder for Paragraphs and Documents}, author={Li, Jiwei and Luong, Minh-Thang and Jurafsky, Dan}, booktitle={Proceedings of ACL}, year={2015} annote = { Generating multi-sentence (paragraphs/documents) is a challenging problem for recurrent networks models. In this paper, we explore an important step toward this generation task: training an LSTM (Longshort term memory) auto-encoder to preserve and reconstruct multi-sentence paragraphs. }} @article{camacho-pilehvar-15-ACL_unified-multilingual-semantics-of-concepts, title={A unified multilingual semantic representation of concepts}, author={Camacho-Collados, Jos{\'e} and Pilehvar, Mohammad Taher and Navigli, Roberto}, journal={Proceedings of ACL, Beijing, China}, year={2015}, annote = { ABST: most existing semantic representation techniques cannot be used effectively for the representation of individual word senses. We put forward a novel multilingual concept representation, called MUFFIN, which not only enables accurate representation of word senses in different languages, but also provides multiple advantages over existing approaches. MUFFIN represents a given concept in a unified semantic space irrespective of the language of interest, enabling cross-lingual comparison of different concepts. We evaluate our approach in two different evaluation benchmarks, semantic similarity and Word Sense Disambiguation, reporting state-of-the-art performance on several standard datasets. vector space representations are * conventionally co-occurrence based (Salton et al., 1975; Turney and Pantel, 2010; Landauer and Dooley, 2002), * the newer predictive branch (Collobert andWeston, 2008; Mikolov et al., 2013; Baroni et al., 2014), both suffer from a major drawback: they are unable to model individual word senses or concepts, as they conflate different meanings of a word into a single vectorial representation. [hinders WSD) There have been several efforts to adapt and apply distributional approaches to the representation of word senses (Pantel and Lin, 2002; Brody and Lapata, 2009; Reisinger and Mooney, 2010; Huang et al., 2012). However, none of these techniques provides representations that are already linked to a standard sense inventory, and consequently such mapping has to be carried out either manually, or with the help of sense-annotated data. Chen et al. (2014) addressed this issue and obtained vectors for individual word senses by leveraging WordNet glosses. NASARI (Camacho-Collados et al., 2015) is another approach that obtains accurate sense-specific representations by combining the complementary knowledge from WordNet and Wikipedia. Graph-based approaches have also been successfully utilized to model individual words (Hughes and Ramage, 2007; Agirre et al., 2009; Yeh et al., 2009), or concepts (Pilehvar et al., 2013; Pilehvar and Navigli, 2014), drawing on the structural properties of semantic networks. The applicability of all these techniques, however, is usually either constrained to a single language (usually English), or to a specific task. }} @inproceedings{levy-goldberg-14-NIPS_neural-word-embedding-as-matrix-factorization, title={Neural word embedding as implicit matrix factorization}, author={Levy, Omer and Goldberg, Yoav}, booktitle={Advances in Neural Information Processing Systems}, pages={2177--2185}, year={2014}, annote = { suggests that what the word2vec model is doing is effectively factorizing a count matrix - the LOW RANK FACTORIZATION gives the word embeddings. shows that the skip-gram model of Mikolov et al, when trained with their negative sampling approach can be understood as a weighted matrix factorization of a word-context matrix with cells weighted by point wise mutual information (PMI), which has long been empirically known to be a useful way of constructing word-context matrices for learning semantic representations of words. }} @article{stratos-collins-15-ACL_word-embeddings-from-count-matrices, title={Model-based Word Embeddings from Decompositions of Count Matrices}, author={Stratos, Karl and Collins, Michael and Hsu, Daniel} booktitle={Proceedings of ACL}, year={2015}, annote = { A successful method for deriving word embeddings is the negative sampling training of the skip-gram model suggested by Mikolov et al. (2013b) and implemented in the popular software WORD2VEC. The form of its training objective was motivated by efficiency considerations, but has subsequently been interpreted by Levy and Goldberg (2014b) as seeking a LOW-RANK FACTORIZATION of a matrix whose entries are word-context co-occurrence counts, scaled and transformed in a certain way. This observation sheds new light on WORD2VEC, yet also raises several new questions about word embeddings based on decomposing count data. What is the right matrix to decompose? Are there rigorous justifications for the choice of matrix and count transformations? We investigate these qs through the decomposition specified by canonical correlation analysis (CCA, Hotelling, 1936), a powerful technique for inducing generic representations whose computation is efficiently and exactly reduced to that of a matrix singular value decomposition (SVD). We build on and strengthen the work of Stratos et al. (2014) which uses CCA for learning the class of HMMs underlying Brown clustering. CCA seeks to maximize the Pearson correlation coefficient between scalar random variables L;R, which is a value in [-11; 1] indicating the degree of linear dependence between L and R. }} @article{bride-cruys-asher-15-ACL_lexical-composition-in-distributional-semantics, title={A Generalisation of Lexical Functions for Composition in Distributional Semantics}, author={Bride, Antoine and Van de Cruys, Tim and Asher, Nicholas} booktitle={Proceedings of ACL}, year={2015}, annote = { ABSTRACT: It is not straightforward to construct meaning representations beyond the level of individual words – i.e. the combination of words into larger units – using distributional methods. Our contribution is twofold. First of all, we carry out a large-scale evaluation, comparing different composition methods within the distributional framework for the cases of both adjective-noun and noun-noun composition, making use of a newly developed dataset. Secondly, we propose a novel method for composition, which generalises the approach by Baroni and Zamparelli (2010). The performance of our novel method is also evaluated on our new dataset and proves competitive with the best methods. Requires availability of parser --- the additive and multiplicative model (Mitchell and Lapata, 2008) are entirely general, and take as input automatically constructed vectors for adjectives and nouns. The method by Baroni and Zamparelli, on the other hand, requires the acquisition of a particular function for each adjective, represented by a matrix. The second goal of our paper is to generalise the functional approach in order to eliminate the need for an individual function for each adjective. To this goal, we automatically learn a generalised lexical function, based on Baroni and Zamparelli’s approach. This generalised function combines with an adjective vector and a noun vector in a generalised way. The performance of our novel generalised lexical function approach is evaluated on our test sets and proves competitive with the best, extant methods. Our semantic space for French was built using the FRWaC corpus (Baroni et al., 2009) – about 1,6 billion words of web texts – which has been tagged with MElt tagger (Denis et al., 2010) and parsed with MaltParser (Nivre et al., 2006a), trained on a dependency-based version of the French treebank (Candito et al., 2010). Our semantic space for English has been built using the UKWaC corpus (Baroni et al., 2009), which consists of about 2 billion words extracted from the web. The corpus has been part of speech tagged and lemmatized with Stanford Part-Of- Speech Tagger (Toutanova and Manning, 2000; Toutanova et al., 2003), and parsed with Malt- Parser (Nivre et al., 2006b) trained on sections 2-21 of the Wall Street Journal section of the Penn Treebank extended with about 4000 questions from the QuestionBank. [http://maltparser.org/mco/english_parser/engmalt.html] For both corpora, we extracted the lemmas of all nouns, adjectives and (bag of words) context words. We only kept those lemmas that consist of alphabetic characters. [filters out dates, numbers and punctuation] }} @inproceedings{purver-sadrzadeh-15_distributional-semantics-to-pragmatics, title={From Distributional Semantics to Distributional Pragmatics?}, author={Purver, Matthew and Sadrzadeh, Mehrnoosh}, booktitle={Interactive Meaning Construction A Workshop at IWCS 2015}, pages={21}, year={2015}, annote = { NO PDF }} ##DEEP LEARNING SYNTAX DISCOVERY @article{kalchbrenner-grefenstette-14_CNN-for-modelling-sentences, title={A convolutional neural network for modelling sentences}, author={Kalchbrenner, Nal and Grefenstette, Edward and Blunsom, Phil}, journal={arXiv preprint arXiv:1404.2188}, year={2014}, annote = { models tree-like hierarchical structures of syntax tested in Hindi by arpit and jayesh. Q. how good is it in detecting phrase boundaries? }} @article{li-chen-15-ACL_visualizing-and-understanding-neural-models-in-NLP, title={Visualizing and Understanding Neural Models in NLP}, author={Li, Jiwei and Chen, Xinlei and Hovy, Eduard and Jurafsky, Dan}, journal={arXiv preprint arXiv:1506.01066}, year={2015}, annote = { }} # MULTIMODAL GROUNDING @inproceedings{kiros-salakhutdinov-14_multimodal-neural-language-models, title={Multimodal neural language models}, author={Kiros, Ryan and Salakhutdinov, Ruslan and Zemel, Rich}, booktitle={Proceedings of the 31st International Conference on Machine Learning (ICML-14)}, pages={595--603}, year={2014}, annote = { Abstract We introduce two multimodal neural language models: models of natural language that can be conditioned on other modalities. An image-text multimodal neural language model can be used to retrieve images given complex sentence queries, retrieve phrase descriptions given image queries, as well as generate text conditioned on images. We show that in the case of image-text modelling we can jointly learn word representations and image features by training our models together with a convolutional network. Unlike many of the existing methods, our approach can generate sentence descriptions for images without the use of templates, structured prediction, and/or syntactic trees. While we focus on imagetext modelling, our algorithms can be easily applied to other modalities such as audio. Neural Language Models: A neural language model improves on n-gram language models by reducing the curse of dimensionality through the use of distributed word representations. Each word in the vocabulary is represented as a real-valued feature vector such that the cosine of the angles between these vectors is high for semantically similar words. Several models have been proposed based on feedforward networks (Bengio et al., 2003), log-bilinear models (Mnih & Hinton, 2007), skip-gram models (Mikolov et al., 2013) and recurrent neural networks (Mikolov et al., 2010; 2011). Training can be sped up through the use of hierarchical softmax (Morin & Bengio, 2005) or noise contrastive estimation (Mnih & Teh, 2012) Bengio, Yoshua, Ducharme, R´ejean, Vincent, Pascal, and Janvin, Christian. A neural probabilistic language model. J. Mach. Learn. Res., 3:1137–1155, March 2003. Mikolov, Tomas, Karafi´at, Martin, Burget, Lukas, Cernock`y, Jan, and Khudanpur, Sanjeev. Recurrent neural network based language model. In INTERSPEECH, pp. 1045–1048, 2010. Mikolov, Tomas, Deoras, Anoop, Povey, Daniel, Burget, Lukas, and Cernocky, Jan. Strategies for training large scale neural network language models. In ASRU, pp. 196–201. IEEE, 2011. Mikolov, Tomas, Chen, Kai, Corrado, Greg, and Dean, Jeffrey. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013. Mnih, Andriy and Hinton, Geoffrey. Three new graphical models for statistical language modelling. In ICML, pp. 641–648. ACM, 2007. Mnih, Andriy and Teh, Yee Whye. A fast and simple algorithm for training neural probabilistic language models. arXiv preprint arXiv:1206.6426, 2012. Morin, Frederic and Bengio, Yoshua. Hierarchical probabilistic neural network language model. In AISTATS, pp. 246–252, 2005. img/kiros-14_multimodal-language_image-description-learning.jpg Figure 1. Left two columns: Sample description retrieval given images. Right two columns: description generation. Each description was initialized to ‘in this picture there is’ or ‘this product contains a’, with 50 subsequent words generated. [I think the top row are input, and bottom is generation; not L/R as in caption] Image Description Generation: A growing body of research has explored how to generate realistic text descriptions given an image. Farhadi et al. (2010) consider learning an intermediate meaning space to project image and sentence features allowing them to retrieve text from images and vice versa. Kulkarni et al. (2011) construct a CRF using unary potentials from objects, attributes and prepositions and high-order potentials from text corpora, using an n-gram model for decoding and templates for constraints. To allow for more descriptive and poetic generation, Mitchell et al. (2012) propose the use of syntactic trees constructed from 700,000 Flickr images and text descriptions. For large scale description generation, Ordonez et al. (2011) showed that non-parametric approaches are effective on a dataset of one million image-text captions. More recently, Socher et al. (2014) show how to map sentence representations from recursive networks into the same space as images. We note that unlike most existing work, our generated text comes directly from language model samples without any additional templates, structure, or constraints. Multimodal Representation Learning: Deep learning methods have been successfully used to learn representations from multiple modalities. Ngiam et al. (2011) proposed using deep autoencoders to learn features from audio and video, while Srivastava & Salakhutdinov (2012) introduced the multimodal deep Boltzmann machine as a joint model of images and text. Unlike Srivastava & Salakhutdinov (2012), our proposed models are conditional and go beyond bag-of-word features. More recently, Socher et al. (2013) and Frome et al. (2013) propose methods for mapping images into a text representation space learned from a language model that incorporates global context (Huang et al., 2012) or a skip-gram model (Mikolov et al., 2013), respectively. This allowed Socher et al. (2013); Frome et al. (2013) to perform zero-shot learning, generalizing to classes the model has never seen before. Similar to our work, the authors combine convolutional networks with a language model but our work instead focuses on text generation and retrieval as opposed to object classification. Frome, Andrea, Corrado, Greg S, Shlens, Jon, Bengio, Samy, Dean, Jeffrey, and MarcAurelio Ranzato, Tomas Mikolov. Devise: A deep visual-semantic embedding model. 2013. Huang, Eric H, Socher, Richard, Manning, Christopher D, and Ng, Andrew Y. Improving word representations via global context and multiple word prototypes. In ACL, pp. 873–882, 2012. Socher, Richard, Ganjoo, Milind, Manning, Christopher D, and Ng, Andrew. Zero-shot learning through cross-modal transfer. In NIPS, pp. 935–943, 2013. The remainder of the paper is structured as follows. We first review the log-bilinear model of Mnih & Hinton (2007) as it forms the foundation for our work. We then introduce our two proposed models as well as how to perform joint image-text feature learning. Finally, we describe our experiments and results. References Bengio, Yoshua, Ducharme, R´ejean, Vincent, Pascal, and Janvin, Christian. A neural probabilistic language model. J. Mach. Learn. Res., 3:1137–1155, March 2003. Berg, Tamara L, Berg, Alexander C, and Shih, Jonathan. Automatic attribute discovery and characterization from noisy web data. In ECCV, pp. 663–676. Springer, 2010. Coates, Adam and Ng, Andrew. The importance of encoding versus training with sparse coding and vector quantization. In ICML, pp. 921–928, 2011. Multimodal Neural Language Models Collobert, Ronan and Weston, Jason. A unified architecture for natural language processing: Deep neural networks with multitask learning. In ICML, pp. 160–167. ACM, 2008. Donahue, Jeff; Jia, Yangqing, Vinyals, Oriol, Hoffman, Judy; Zhang, Ning, Tzeng, Eric, and Darrell, Trevor. Decaf: A deep convolutional activation feature for generic visual recognition. Farhadi, Ali, Hejrati, Mohsen, Sadeghi, Mohammad Amin, Young, Peter, Rashtchian, Cyrus, Hockenmaier, Julia, and Forsyth, David. Every picture tells a story: Generating sentences from images. In ECCV, pp. 15–29. Springer, 2010. Frome, Andrea, Corrado, Greg S, Shlens, Jon, Bengio, Samy, Dean, Jeffrey, and MarcAurelio Ranzato, Tomas Mikolov. Devise: A deep visual-semantic embedding model. 2013. Grangier, David, Monay, Florent, and Bengio, Samy. A discriminative approach for the retrieval of images from text queries. In Machine Learning: ECML 2006, pp. 162–173. Springer, 2006. Grubinger, Michael, Clough, Paul, M¨uller, Henning, and Deselaers, Thomas. The iapr tc-12 benchmark: A new evaluation resource for visual information systems. In InternationalWorkshop OntoImage, pp. 13–23, 2006. Gupta, Ankush and Mannem, Prashanth. From image annotation to image description. In NIP, pp. 196–204. Springer, 2012. Gupta, Ankush, Verma, Yashaswi, and Jawahar, CV. Choosing linguistics over vision to describe images. In AAAI, 2012. Hodosh, Micah, Young, Peter, and Hockenmaier, Julia. Framing image description as a ranking task: data, models and evaluation metrics. JAIR, 47(1):853–899, 2013. Huang, Eric H, Socher, Richard, Manning, Christopher D, and Ng, Andrew Y. Improving word representations via global context and multiple word prototypes. In ACL, pp. 873–882, 2012. Kiros, Ryan and Szepesv´ari, Csaba. Deep representations and codes for image auto-annotation. In NIPS, pp. 917–925, 2012. Krizhevsky, Alex, Hinton, Geoffrey E, et al. Factored 3-way restricted boltzmann machines for modeling natural images. In AISTATS, pp. 621–628, 2010. Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoff. Imagenet classification with deep convolutional neural networks. In NIPS, pp. 1106–1114, 2012. Kulkarni, Girish, Premraj, Visruth, Dhar, Sagnik, Li, Siming, Choi, Yejin, Berg, Alexander C, and Berg, Tamara L. Baby talk: Understanding and generating simple image descriptions. In CVPR, pp. 1601–1608. IEEE, 2011. Memisevic, Roland and Hinton, Geoffrey. Unsupervised learning of image transformations. In CVPR, pp. 1–8. IEEE, 2007. Mitchell, Margaret, Han, Xufeng, Dodge, Jesse, Mensch, Alyssa, Goyal, Amit, Berg, Alex, Yamaguchi, Kota, Berg, Tamara, Stratos, Karl, and Daum´e III, Hal. Midge: Generating image descriptions from computer vision detections. In EACL, pp. 747–756, 2012. Morin, Frederic and Bengio, Yoshua. Hierarchical probabilistic neural network language model. In AISTATS, pp. 246–252, 2005. Ngiam, Jiquan, Khosla, Aditya, Kim, Mingyu, Nam, Juhan, Lee, Honglak, and Ng, Andrew. Multimodal deep learning. In ICML, pp. 689–696, 2011. Oliva, Aude and Torralba, Antonio. Modeling the shape of the scene: A holistic representation of the spatial envelope. IJCV, 42(3):145–175, 2001. Ordonez, Vicente, Kulkarni, Girish, and Berg, Tamara L. Im2text: Describing images using 1 million captioned photographs. In NIPS, pp. 1143–1151, 2011. Papineni, Kishore, Roukos, Salim, Ward, Todd, and Zhu, Wei- Jing. Bleu: a method for automatic evaluation of machine translation. In ACL, pp. 311–318. ACL, 2002. Socher, Richard, Ganjoo, Milind, Manning, Christopher D, and Ng, Andrew. Zero-shot learning through cross-modal transfer. In NIPS, pp. 935–943, 2013. Socher, Richard, Le, Quoc V, Manning, Christopher D, and Ng, Andrew Y. Grounded compositional semantics for finding and describing images with sentences. TACL, 2014. Srivastava, Nitish and Salakhutdinov, Ruslan. Multimodal learning with deep boltzmann machines. In NIPS, pp. 2231–2239, 2012. Swersky, Kevin, Snoek, Jasper, and Adams, Ryan. Multi-task bayesian optimization. In NIPS, 2013. Turian, Joseph, Ratinov, Lev, and Bengio, Yoshua. Word representations: a simple and general method for semi-supervised learning. In ACL, pp. 384–394. Association for Computational Linguistics, 2010. Wang, Tao,Wu, David J, Coates, Adam, and Ng, Andrew Y. End-to-end text recognition with convolutional neural networks. In ICPR, pp. 3304–3308. IEEE, 2012. }} @incollection{senapati-garain-14_maximum-entropy-based-honorificity_bengali-anaphora, title={A Maximum Entropy Based Honorificity Identification for Bengali Pronominal Anaphora Resolution}, author={Senapati, Apurbalal and Garain, Utpal}, booktitle={Computational Linguistics and Intelligent Text Processing}, pages={319--329}, year={2014}, publisher={Springer}, annote = { }} @article{tackstrom-ganchev-das-15_efficient-structured-learning-for-semantic-role-labeling, title={Efficient Inference and Structured Learning for Semantic Role Labeling}, author={T{\"a}ckstr{\"o}m, Oscar and Ganchev, Kuzman and Das, Dipanjan}, journal={Transactions of the Association for Computational Linguistics}, volume={3}, pages={29--41}, year={2015}, annote = { ABSTRACT We present a dynamic programming algorithm for efficient constrained inference in semantic role labeling. The algorithm tractably captures a majority of the structural constraints examined by prior work in this area, which has resorted to either approximate methods or off-the-shelf integer linear programming solvers. In addition, it allows training a globally-normalized log-linear model with respect to constrained conditional likelihood. We show that the dynamic program is several times faster than an off-the-shelf integer linear programming solver, while reaching the same solution. Furthermore, we show that our structured model results in significant improvements over its local counterpart, achieving state-of-the-art results on both PropBank- and FrameNet-annotated corpora. }} @article{santos-15_boosting-NER-w-neural-character-embeddings, title={Boosting Named Entity Recognition with Neural Character Embeddings}, author={Santos, Cicero Nogueira dos and Guimar{\~a}es, Victor}, journal={arXiv preprint arXiv:1505.05008}, year={2015}, annote = { uses charWNN ABSTRACT Most state-of-the-art named entity recognition (NER) systems rely on handcrafted features and on the output of other NLP tasks such as part-of-speech (POS) tagging and text chunking. In this work we propose a language-independent NER system that uses automatically learned features only. Our approach is based on the CharWNN deep neural network, which uses word-level and character-level representations (embeddings) to perform sequential classification. We perform an extensive number of experiments using two annotated corpora in two different languages: HAREM I corpus, which contains texts in Portuguese; and the SPA CoNLL- 2002 corpus, which contains texts in Spanish. Our experimental results shade light on the contribution of neural character embeddings for NER. Moreover, we demonstrate that the same neural network which has been successfully applied to POS tagging can also achieve state-of-the-art results for language-independet NER, using the same hyperparameters, and without any handcrafted features. For the HAREM I corpus, CharWNN outperforms the state-of-the-art system by 7.9 points in the F1-score for the total scenario (ten NE classes), and by 7.2 points in the F1 for the selective scenario (five NE classes). Given a sentence, the network gives for each word a score for each class (tag) tau IN T. As depicted in Figure 1, in order to score a word, the network takes as input a fixed-sized window of words centralized in the target word. The input is passed through a sequence of layers where features with increasing levels of complexity are extracted. The output for the whole sentence is then processed using the Viterbi algorithm (Viterbi, 1967) to perform structured prediction. For a detailed description of the CharWNN neural network we refer the reader to (dos Santos and Zadrozny, 2014). 2.1 Word- and Character-level Embeddings As illustrated in Figure 1, the first layer of the network transforms words into real-valued feature vectors (embeddings). These embeddings are meant to capture morphological, syntactic and semantic information about the words. We use a fixed-sized word vocabulary Vwrd, and we consider that words are composed of characters from a fixed-sized character vocabulary V chr. Given a sentence consisting of N words (w1;w2; :::;wN), every word wn is converted into a vector un = [rwrd; rwch], which is composed of two subvectors: the word-level embedding rwrd 2 Rdwrd and the character-level embedding rwch 2 Rclu of wn. While word-level embeddings capture syntactic and semantic information, character-level embeddings capture morphological and shape information. Word-level embeddings are encoded by column vectors in an embedding matrix Wwrd 2 RdwrdjV wrdj, and retrieving the embedding of a particular word consists in a simple matrix-vector multiplication. The matrix Wwrd is a parameter to be learned, and the size of the word-level embedding dwrd is a hyperparameter to be set by the user. Given a word w composed of M characters (c1; c2; :::; cM), we first transform each character cm into a character embedding rchr_m . Character embeddings are encoded by column vectors in the embedding matrix [Wchr IN Rdchr x |Vchr|. Given a character c, its embedding rchr is obtained by the matrix-vector product: rchr = Wchrvc, where vc is a vector of size |V_chr| which has value 1 at index c and zero in all other positions. The input for the convolutional layer is the sequence of character embeddings (rchr1 ; rchr2 ; :::; rchrM). The convolutional layer applies a matrix-vector operation to each window of size kchr of successive windows in the sequence (rchr1 ; rchr2 ; :::; rchrM). Let us define the vector zm IN Rdchr kchr as the concatenation of the character embedding m, its (kchr - 1)/2 left neighbors, and its (kchr - 1)/2 right neighbors: }} @article{le-P-zuidema-15_compositional-distributional-semanticsw-LSTM, title={Compositional Distributional Semantics with Long Short Term Memory}, author={Le, Phong and Zuidema, Willem}, journal={arXiv preprint arXiv:1503.02510}, year={2015}, annote = { Abstract We are proposing an extension of the recursive neural network that makes use of a variant of the long short-term memory architecture. The extension allows information low in parse trees to be stored in a memory register (the ‘memory cell’) and used much later higher up in the parse tree. This provides a solution to the vanishing gradient problem and allows the network to capture long range dependencies. Experimental results show that our composition outperformed the traditional neural-network composition on the Stanford Sentiment Treebank. }} @article{shutova-tandon-15_perceptually-grounded-selectional-preferences, title={Perceptually grounded selectional preferences}, author={Shutova, Ekaterina and Tandon, Niket and de Melo, Gerard} booktitle={Proceedings of ACL}, year={2015}, annote = { Selectional preference (SP) : "fly" prefers to select "bird" rather than "dog" "winds" prefers "high" rather than "large" to compute: Take all verb phrases (from a large parse of a corpus) with head "fly" -> find frequency of "bird" as main argument vs that of "dog" ABSTRACT Selectional preference (SPs) are widely used in NLP as a rich source of semantic information. While SPs have been traditionally induced from textual data, human lexical acquisition is known to rely on both linguistic and perceptual experience. We present the first SP learning method that simultaneously draws knowledge from text, images and videos, using image and video descriptions to obtain visual features. Our results show that it outperforms linguistic and visual models in isolation, as well as the existing SP induction approaches. We use the BNC as an approximation of linguistic knowledge and a large collection of tagged images and videos from Flickr (www.flickr.com) as an approximation of perceptual knowledge. The human-annotated labels that accompany media on Flickr enable us to acquire predicate-argument cooccurrence information. predicate similarity: Pad´o et al. (2007) and Erk (2007) used similarity metrics to approximate selectional preference classes. Their underlying hypothesis is that a predicate-argument combination (p, a) is felicitous if the predicate p is frequently observed in the data with the arguments a0 similar to a. The systems compute similarities between distributional representations of arguments in a vector space. (requires parser or chunker) }} @inproceedings{andreas-klein-14_much-word-embeddings-encode-syntax, title={How much do word embeddings encode about syntax}, author={Andreas, Jacob and Klein, Dan}, booktitle={Proceedings of ACL-14}, year={2014}, annote = { Abstract Do continuous word embeddings encode any useful information for constituency parsing? We isolate three ways in which word embeddings might augment a state-of-the-art statistical parser: * by connecting out-of-vocabulary words to known ones, * by encouraging common behavior among related in-vocabulary words, and * by directly providing features for the lexicon. We test each of these hypotheses with a targeted change to a state-of-the-art baseline. Despite small gains on extremely small supervised training sets, we find that extra information from embeddings appears to make little or no difference to a parser with adequate training data. Our results support an overall hypothesis that word embeddings import syntactic information that is ultimately redundant with distinctions learned from treebanks in other ways. --- Semi-supervised and unsupervised models for a variety of core NLP tasks, including named-entity recognition (Freitag, 2004), part-of-speech tagging (Sch¨utze, 1995), and chunking (Turian et al., 2010) have been shown to benefit from the inclusion of word embeddings as features. In the other direction, access to a syntactic parse has been shown to be useful for constructing word embeddings for phrases compositionally (Hermann and Blunsom, 2013; Andreas and Ghahramani, 2013). Dependency parsers have seen gains from distributional statistics in the form of discrete word clusters (Koo et al., 2008), and recent work (Bansal et al., 2014) suggests that similar gains can be derived from embeddings like the ones used in this paper. [bansal-gimpel-14_tailoring-word-vectors-for-dependency-parsing] It seems clear that word embeddings exhibit some syntactic structure. Consider Figure 1, which shows embeddings for a variety of English determiners, projected onto their first two principal components. img/andreas-14_wv-clusters-for-determiners We can see that the quantifiers EACH and EVERY cluster together, as do FEW and MOST. These are precisely the kinds of distinctions between determiners that state-splitting in the Berkeley parser has shown to be useful (Petrov and Klein, 2007), and existing work (Mikolov et al., 2013b) has observed that such regular embedding structure extends to many other parts of speech. But we don’t know how prevalent or important such “syntactic axes” are in practice. Thus we have two questions: Are such groupings (learned on large data sets but from less syntactically rich models) better than the ones the parser finds on its own? How much data is needed to learn them without word embeddings? results : for two kinds of embeddings (Col&West and CBOW). For both choices of embedding, the pooling and OOV models provide small gains with very little training data, but no gains on the full training set. our results are consistent with two claims: (1) unsupervised word embeddings do contain some syntactically useful information, but (2) this information is redundant with what the model is able to determine for itself from only a small amount of labeled training data. }} @inproceedings{guo-che-14_revisiting-embedding-for-semi-supervised-learning, title={Revisiting embedding features for simple semi-supervised learning}, author={Guo, Jiang and Che, Wanxiang and Wang, Haifeng and Liu, Ting}, booktitle={Proceedings of EMNLP}, pages={110--120}, year={2014}, annote = { }} @article{ling-dyer-15_two-simple-adaptations-of-word2v-ec-for-syntax, title={Two/Too Simple Adaptations of word2vec for Syntax Problems}, author={Ling, Wang and Chris Dyer and Alan Black and Trancoso, Isabel} journal={NAACL 2015}, year = {2015}, annote = { Proposes "Structured Word2Vec" - more sensitive to word order within the context in both CBOW and skip-Ngram methods, the order of the context words does not influence the prediction output. As such, while these methods may find similar representations for semantically similar words, they are less likely to representations based on the syntactic properties of the words. Structured Word2Vec Abstract We present two simple modifications to Word2Vec... The main issue with the original models is the fact that they are insensitive to word order. While order independence is useful for inducing semantic representations, this leads to suboptimal results when they are used to solve syntax-based problems. We show improvements in part-ofspeech tagging and dependency parsing using our proposed models. }} @inproceedings{bansal-gimpel-14_tailoring-word-vectors-for-dependency-parsing, title={Tailoring continuous word representations for dependency parsing}, author={Bansal, Mohit and Gimpel, Kevin and Livescu, Karen}, booktitle={Proceedings of the Annual Meeting of the Association for Computational Linguistics}, year={2014}, annote = { Koo et al. (2008) found improvement on in-domain dependency parsing using features based on discrete Brown clusters. In this paper, we experiment with parsing features derived from continuous representations. We find that simple attempts based on discretization of individual word vector dimensions do not improve parsing. We see gains only after first performing a hierarchical clustering of the continuous word vectors and then using features based on the hierarchy. We compare several types of continuous representations, including those made available by other researchers (Turian et al., 2010; Collobert et al., 2011; Huang et al., 2012), and embeddings we have trained using the approach of Mikolov et al. (2013a), which is orders of magnitude faster than the others. The representations exhibit different characteristics, which we demonstrate using both intrinsic metrics and extrinsic parsing evaluation. We report significant improvements over our baseline on both the Penn Treebank (PTB; Marcus et al., 1993) and the English Web treebank (Petrov and McDonald, 2012). While all embeddings yield some parsing improvements, we find larger gains by tailoring them to capture similarity in terms of context within syntactic parses. To this end, we use two simple modifications to the models of Mikolov et al. (2013a): a smaller context window, and conditioning on syntactic context (dependency links and labels). Interestingly, the Brown clusters of Koo et al. (2008) prove to be difficult to beat, but we find that our syntactic tailoring can lead to embeddings that match the parsing performance of Brown (on all test sets) in a fraction of the training time. Finally, a simple parser ensemble on all the representations achieves the best results, suggesting their complementarity for dependency parsing. 2 ContinuousWord Representations There are many ways to train continuous representations; in this paper, we are primarily interested in neural language models (Bengio et al., 2003), which use neural networks and local context to learn word vectors. Several researchers have made their trained representations publicly available, which we use directly in our experiments. In particular, we use the SENNA embeddings of Collobert et al. (2011); the scaled TURIAN embeddings (C&W) of Turian et al. (2010); and the HUANG global-context, single-prototype embeddings of Huang et al. (2012). We also use the BROWN clusters trained by Koo et al. (2008). Details are given in Table 1. 2.1.1 Smaller ContextWindows We find that context window size w affects the embeddings substantially: with large w, words group with others that are topically-related; with small w, grouped words tend to share the same POS tag. We discuss this further in the intrinsic evaluation presented in x2.2. FN. We created a special vector for unknown words by averaging the vectors for the 50K least frequent words; we did not use this vector for the SKIPDEP (x2.1.2) setting because it performs slightly better without it. 2.1.2 Syntactic Context We expect embeddings to help dependency parsing the most when words that have similar parents and children are close in the embedding space. To target this type of similarity, we train the SKIP model on dependency context instead of the linear context in raw text. When ordinarily training SKIP embeddings, words v0 are drawn from the neighborhood of a target word v, and the sum of logprobabilities of each v0 given v is maximized. We propose to instead choose v0 from the set containing the grandparent, parent, and children words of v in an automatic dependency parse. A simple way to implement this idea is to train the original SKIP model on a corpus of dependency links and labels. For this, we parse the BLLIP corpus (minus PTB) using our baseline dependency parser, then build a corpus in which each line contains a single child word c, its parent word p, its grandparent g, and the dependency label l of thelink... Evaluation: Two measures Word similarity (SIM): One widely-used evaluation compares distances in the continuous space to human judgments of word similarity using the 353-pair dataset of Finkelstein et al. (2002). We compute cosine similarity between the two vectors in each word pair, then order the word pairs by similarity and compute Spearman’s rank correlation coefficient () with the gold similarities. Embeddings with high capture similarity in terms of paraphrase and topical relationships. Clustering-based tagging accuracy (M-1): Intuitively, we expect embeddings to help parsing the most if they can tell us when two words are similar syntactically. To this end, we use a metric based on unsupervised evaluation of POS taggers. We perform clustering and map each cluster to one POS tag so as to maximize tagging accuracy, where multiple clusters can map to the same tag. For clustering, we use k-means with k = 1000 and initialize by placing centroids on the 1000 most-frequent words. Table 2 shows these metrics for representations used in this paper. The BROWN clusters have the highest M-1, indicating high cluster purity in terms of POS tags. The HUANG embeddings have the highest SIM score but low M-1, presumably because they were trained with global context, making them more tuned to capture topical similarity. We compare several values for the window size (w) used when training the SKIP embeddings, finding that small w leads to higher M-1 and lower SIM. Table 3 shows examples of clusters obtained by clustering SKIP embeddings of w = 1 versus w = 10, and we see that the former correspond closely to POS tags, while the latter are much more topically-coherent and contain mixed POS tags.4 For parsing experiments, we choose w = 2 for CBOW and w = 1 for SKIP. Finally, our SKIPDEP embeddings, trained with syntactic context and w = 1 (x2.1.2), achieve the highest M-1 of all continuous representations. In x4, we will relate these intrinsic metrics to extrinsic parsing performance. Table 3: Example clusters 1 [Mr., Mrs., Ms., Prof., ...], [Jeffrey, Dan, Robert, Peter, ...], [Johnson, Collins, Schmidt, Freedman, ...], [Portugal, Iran, Cuba, Ecuador, ...], [CST, 4:30, 9-10:30, CDT, ...], [his, your, her, its, ...], [truly, wildly, politically, financially, ...] 10 [takeoff, altitude, airport, carry-on, airplane, flown, landings, ...], [health-insurance, clinic, physician, doctor, medical, health-care, ...], [financing, equity, investors, firms, stock, fund, market, ...] Table 3: Example clusters for SKIP embeddings with window size w = 1 (syntactic) and w = 10 (topical). }} @article{lin-ammar-15_unsupervised-pos-induction-with-word-embeddings, title={Unsupervised POS Induction with Word Embeddings}, author={Lin, Chu-Cheng and Ammar, Waleed and Dyer, Chris and Levin, Lori}, journal={arXiv preprint arXiv:1503.06760}, year={2015}, annote = { Abstract Unsupervised word embeddings have been shown to be valuable as features in supervised learning problems; however, their role in unsupervised problems has been less thoroughly explored. In this paper, we show that embeddings can likewise add value to the problem of unsupervised POS induction. In two representative models of POS induction, we replace multinomial distributions over the vocabulary with multivariate Gaussian distributions over word embeddings and observe consistent improvements in eight languages. We also analyze the effect of various choices while inducing word embeddings on “downstream” POS induction results. 1 Introduction we explore the effect of replacing words with their vector space embeddings1 in two POS induction models: the classic first-order HMM (Kupiec, 1992) and the newly introduced conditional random field autoencoder (Ammar et al., 2014). In each model, instead of using a conditional multinomial distribution2 to generate a word token wi IN V given a POS tag ti IN T, we use a conditional Gaussian distribution and generate a d-dimensional word embedding vwi IN Rd given ti. Structured skip-gram embeddings (Ling et al., 2015) extend the standard skip-gram embeddings (Mikolov et al., 2013) by taking into account the relative positions of words in a given context. We use the tool word2vec3 and Ling et al. (2015)’s modified version4 to generate both plain and structured skip-gram embeddings in nine languages. [Ling et al.2015] Wang Ling, Chris Dyer, Alan Black, and Isabel Trancoso. 2015. Two/too simple adaptations of word2vec for syntax problems. In Proc. of NAACL. 4.1 Choice of POS Induction Models Here, we compare the following models for POS induction: * Baseline: HMM with multinomial emissions (Kupiec, 1992), * Baseline: HMM with log-linear emissions (Berg- Kirkpatrick et al., 2010), * Baseline: CRF autoencoder with multinomial reconstructions (Ammar et al., 2014),7 * Proposed: HMM with Gaussian emissions, and * Proposed: CRF autoencoder with Gaussian reconstructions. DATA. To train the POS induction models, we used the plain text from the training sections of the ** CoNLL-X shared task (Buchholz and Marsi, 2006) (for Danish and Turkish), the ** CoNLL 2007 shared task (Nivre et al., 2007) (for Arabic, Basque, Greek, Hungarian and Italian), and the ** Ukwabelana corpus (Spiegler et al., 2010) (for Zulu). For evaluation, we obtain the corresponding gold-standard POS tags by deterministically mapping the language-specific POS tags in the aforementioned corpora to the corresponding universal POS tag set (Petrov et al., 2012). This is the same set up we used in (Ammar et al., 2014). [Petrov et al.2012] Slav Petrov, Dipanjan Das, and Ryan McDonald. 2012. A universal part-of-speech tagset. In Proc. of LREC, May. }} @inproceedings{deMarneffe-dozat-14_universal-dependencies_cross-linguistic, title={Universal Stanford Dependencies: A cross-linguistic typology}, author={De Marneffe, Marie-Catherine and Dozat, Timothy and Silveira, Natalia and Haverinen, Katri and Ginter, Filip and Nivre, Joakim and Manning, Christopher D}, booktitle={Proceedings of LREC}, pages={4585--4592}, year={2014}, annote = { Revisiting the now de facto standard Stanford dependency representation, we propose an improved taxonomy to capture grammatical relations across languages, including morphologically rich ones. We suggest a two-layered taxonomy: a set of broadly attested universal grammatical relations, to which language-specific relations can be added. We emphasize the lexicalist stance of the Stanford Dependencies, which leads to a particular, partially new treatment of compounding, prepositions, and morphology. We show how existing dependency schemes for several languages map onto the universal taxonomy proposed here and close with consideration of practical implications of dependency representation choices for NLP applications, in particular parsing. We consider how to treat grammatical relations in morphologically rich languages, including achieving an appropriate parallelism between expressing grammatical relations by prepositions versus morphology. We show how existing dependency schemes for other languages which draw from Stanford Dependencies (Chang et al., 2009; Bosco et al., 2013; Haverinen et al., 2013; Seraji et al., 2013; McDonald et al., 2013; Tsarfaty, 2013) can be mapped onto the new taxonomy proposed here. We emphasize the lexicalist stance of both most work in NLP and the syntactic theory on which Stanford Dependencies is based, and hence argue for a particular treatment of compounding and morphology. We also discuss the different forms of dependency representation that should be available for SD to be maximally useful for a wide range of NLP applications, converging on three versions: the basic one, the enhanced one (which adds extra dependencies), and a particular form for parsing. de Marneffe, M.-C. and Manning, C. D. (2008). The Stanford typed dependencies representation. In Workshop on Cross-framework and Cross-domain Parser Evaluation. }} @article{fonseca-rosa-15_evaluating-word-embeddings_POS-tagging-in-portuguese, title={Evaluating word embeddings and a revised corpus for part-of-speech tagging in Portuguese}, author={Fonseca, Erick R and Rosa, Jo{\~a}o Lu{\'\i}s G and Alu{\'\i}sio, Sandra Maria}, journal={Journal of the Brazilian Computer Society}, volume={21}, number={1}, pages={1--14}, year={2015}, publisher={Springer}, annote = { no PDF, only online: http://link.springer.com/article/10.1186/s13173-014-0020-x/fulltext.html Based on charWNN like model: 7. Maia MRdH, Xexéo GB (2011) Part-of-speech tagging of Portuguese using hidden Markov models with character language model emissions. In: Proceedings of the 8th Brazilian Symposium in Information and Human Language Technology, Cuiabà, Brazil. pp 159–163 Our new version of Mac-Morpho is available at http://nilc.icmc.usp.br/macmorpho/, along with previous versions of the corpus. Our tagger is at http://nilc.icmc.sc.usp.br/nlpnet/. }} @inproceedings{gouws-sogaard-15-naacl_task-specific-bilingual-word-embeddings, title={Simple task-specific bilingual word embeddings}, author={Gouws, Stephan and S{\o}gaard, Anders}, booktitle={Proceedings of NAACL-HLT}, pages={1386--1390}, year={2015}, annote = { }} @inproceedings{santos-zadrozny-14-icml_character-level-repre-for-POS-tagging-portuguese, title={Learning character-level representations for part-of-speech tagging}, author={Santos, Cicero D and Zadrozny, Bianca}, booktitle={Proceedings of the 31st International Conference on Machine Learning (ICML-14)}, pages={1818--1826}, year={2014}, annote = { for morphologically rich languages : instead of doing the analysis at the level of words, use the characters in the words directly. That way the system can find similarities between morphologically inflected words charWNN : generates character-level embeddings for each word, even for words outside the vocabulary. the similarities for characters seem more morphologically biased (see Tables below) ... information about word morphology and shape is crucial for various NLP tasks. Usually, when a task needs morphological or word shape information, the knowledge is included as handcrafted features (Collobert, 2011). One task for which character-level information is very important is part-of-speech (POS) tagging, which consists in labeling each word in a text with a unique POS tag, e.g. noun, verb, pronoun and preposition. In this paper we propose a new deep neural network (DNN) architecture that joins word-level and characterlevel representations to perform POS tagging. The proposed DNN, which we call CharWNN, uses a convolutional layer that allows effective feature extraction from words of any size. At tagging time, the convolutional layer generates character-level embeddings for each word, even for the ones that are outside the vocabulary. We present experimental results that demonstrate the effectiveness of our approach to extract character-level features relevant to POS tagging. Using CharWNN, we create state-of-the-art POS taggers for two languages: English and Portuguese. Both POS taggers are created from scratch, without including any handcrafted feature. Additionally, we demonstrate that a similar DNN that does not use character-level embeddings only achieves state-of-the-art results when using handcrafted features. Although here we focus exclusively on POS tagging, the proposed architecture can be used, without modifications, for other NLP tasks such as text chunking and named entity recognition. 2.1. Initial Representation Levels The first layer of the network transforms words into realvalued feature vectors (embeddings) that capture morphological, syntactic and semantic information about the words. We use a fixed-sized word vocabulary V wrd, and we consider that words are composed of characters from a fixed-sized character vocabulary V chr. Given a sentence consisting of N words fw1;w2; :::;wNg, every word wn is converted into a vector un = [rwrd; rwch], which is composed of two sub-vectors: the word-level embedding rwrd 2 Rdwrd and the character-level embedding rwch 2 Rclu of wn. While word-level embeddings are meant to capture syntactic and semantic information, character-level embeddings capture morphological and shape informatio 2.1.2. CHARACTER-LEVEL EMBEDDINGS Robust methods to extract morphological information from words must take into consideration all characters of the word and select which features are more important for the task at hand. For instance, in the POS tagging task, informative features may appear in the beginning (like the prefix “un” in “unfortunate”), in the middle (like the hyphen in “self-sufficient” and the “h” in “10h30”), or at the end (like suffix “ly” in “constantly”). character convolution - first introduced by Waibel et al. (1989). As depicted in Fig. 1, our convolutional approach produces local features around each character of the word and then combines them using a max operation to create a fixed-sized character-level embedding of the word. Waibel, A., Hanazawa, T., Hinton, G., Shikano, K., and Lang, K. J. Phoneme recognition using time-delay neural networks. IEEE Transactions on Acoustics, Speech and Signal Processing, 37(3):328–339, 1989. Like in (Collobert et al., 2011), we use a prediction scheme that takes into account the sentence structure. The method uses a transition score At;u for jumping from tag t IN T to u IN T in successive words, and a score A0;t for starting from the t-th tag. Given the sentence w1,w2..wN, the score for tag path t1..tN is computed as follows: The input for the convolutional layer is the sequence of character embeddings The convolutional layer applies a matrix-vector operation to each window of size kchr of successive windows in the sequence.. 2.3. Network Training Our network is trained by minimizing a negative likelihood over the training set D. In the same way as in (Collobert et al., 2011), we interpret the sentence score (5) as a conditional probability over a path. For this purpose, we exponentiate the score (5) and normalize it with respect to all possible paths. Taking the log, we arrive at the following conditional log-probability... For the Portuguese experiments, we use three sources of unlabeled text: the Portuguese Wikipedia; the CETENFolha2 corpus; and the CETEMPublico3 corpus. The Portuguese Wikipedia corpus is preprocessed using the same approach applied to the English Wikipedia corpus. However, we implemented our own Portuguese tokenizer. --- Table 1. WSJ Corpus for English POS Tagging. SET SENT. TOKENS OOSV OOUV TRAINING 38,219 912,344 0 6317 DEVELOP. 5,527 131,768 4,467 958 TEST 5,462 129,654 3,649 923 We use the Mac-Morpho corpus to perform POS tagging of Portuguese language text. The Mac-Morpho corpus (Alu´ısio et al., 2003) contains around 1.2 million manually tagged words. Its tagset contains 22 POS tags (41 if punctuation marks are included) and 10 more tags that represent additional semantic aspects. Our Theano based implementation of CharWNN takes around 2h30min to complete eight training epochs for the Mac-Morpho corpus. In our experiments, we use 4 threads in a Intelr Xeon E5-2643 3.30GHz machine. Table 10. Most similar words using character-level embeddings learned with WSJ Corpus. INCONSIDERABLE 83-YEAR-OLD SHEEP-LIKE DOMESTICALLY UNSTEADINESS 0.0055 ------------------------------------------------------------------------------ INCONCEIVABLE 43-YEAR-OLD ROCKET-LIKE FINANCIALLY UNEASINESS 0.0085 INDISTINGUISHABLE 63-YEAR-OLD FERN-LIKE ESSENTIALLY UNHAPPINESS 0.0075 INNUMERABLE 73-YEAR-OLD SLIVER-LIKE GENERALLY UNPLEASANTNESS 0.0015 INCOMPATIBLE 49-YEAR-OLD BUSINESS-LIKE IRONICALLY BUSINESS 0.0040 INCOMPREHENSIBLE 53-YEAR-OLD WAR-LIKE SPECIALLY UNWILLINGNESS 0.025 Table 11. Most similar words using word-level embeddings learned using unlabeled English texts. INCONSIDERABLE 00-YEAR-OLD SHEEP-LIKE DOMESTICALLY UNSTEADINESS 0.0000 ------------------------------------------------------------------------------ INSIGNIFICANT SEVENTEEN-YEAR-OLD BURROWER WORLDWIDE PARESTHESIA 0.00000 INORDINATE SIXTEEN-YEAR-OLD CRUSTACEAN-LIKE 000,000,000 HYPERSALIVATION 0.000 ASSUREDLY FOURTEEN-YEAR-OLD TROLL-LIKE 00,000,000 DROWSINESS 0.000000 UNDESERVED NINETEEN-YEAR-OLD SCORPION-LIKE SALES DIPLOPIA +/- SCRUPLE FIFTEEN-YEAR-OLD UROHIDROSIS RETAILS BREATHLESSNESS -0.00 --- Table 6. Most similar words using character-level embeddings learned with Mac-Morpho Corpus. GRADACOES CLANDESTINAMENTE REVOGACAO DESLUMBRAMENTO DROGASSE ------------------------------------------------------------------------------ TRADICOES GLOBALMENTE RENOVACAO DESPOJAMENTO DIVULGASSE TRADUCOES CRONOMETRICAMENTE REAVALIACAO DESMANTELAMENTO DIRIGISSE ADAPTACOES CONDICIONALMENTE REVITALIZACAO DESREGRAMENTO JOGASSE INTRADUCOES APARENTEMENTE REFIGURACAO DESENRAIZAMENTO PROCURASSE REDACOES FINANCEIRAMENTE RECOLOCACAO DESMATAMENTO JULGASSE ------------------------------------------------------------------------------ At word level: similar to GRADACOES -> TONALIDADES MODULACOES CARACTERIZACOES NUANCAS COLORACOES CLANDESTINAMENTE --> ILEGALMENTE ALI ATAMBUA BRAZZAVILLE VOLUNTARIAMENTE Abstract Distributed word representations have recently been proven to be an invaluable resource for NLP. These representations are normally learned using neural networks and capture syntactic and semantic information about words. Information about word morphology and shape is normally ignored when learning word representations. However, for tasks like part-of-speech tagging, intra-word information is extremely useful, specially when dealing with morphologically rich languages. *** In this paper, we propose a deep neural network that learns character-level representation of words and associate them with usual word representations to perform POS tagging. Using the proposed approach, while avoiding the use of any handcrafted feature, we produce stateof- the-art POS taggers for two languages: English, with 97.32% accuracy on the Penn Treebank WSJ corpus; and Portuguese, with 97.47% accuracy on the Mac-Morpho corpus, where the latter represents an error reduction of 12.2% on the best previous known result. }} %%%%%%%% @inproceedings{socher-bauer-ng-manning-13_parsing-with-vector-grammars, title={Parsing with compositional vector grammars}, author={Socher, Richard and Bauer, John and Manning, Christopher D and Ng, Andrew Y}, booktitle={In Proceedings of the ACL conference}, year={2013}, annote = { Abstract Natural language parsing has typically been done with small sets of discrete categories such as NP and VP, but this representation does not capture the full syntactic nor semantic richness of linguistic phrases, and attempts to improve on this by lexicalizing phrases or splitting categories only partly address the problem at the cost of huge feature spaces and sparseness. Instead, we introduce a Compositional Vector Grammar (CVG), which combines PCFGs with a syntactically untied recursive neural network that learns syntactico-semantic, compositional vector representations. The CVG improves the PCFG of the Stanford Parser by 3.8 % to obtain an F1 score of 90.4%. It is fast to train and implemented approximately as an efficient reranker it is about 20 % faster than the current Stanford factored parser. The CVG learns a soft notion of head words and improves performance on the types of ambiguities that require semantic information such as PP attachments. 1 }} @inproceedings{attardi-15_deepNL-deep-learning-nlp-pipeline, title={DeepNL: a Deep Learning NLP pipeline}, author={Attardi, Giuseppe}, booktitle={Proceedings of NAACL-HLT}, pages={109--115}, year={2015}, annote = { Abstract We present the architecture of a deep learning pipeline for natural language processing. Based on this architecture we built a set of tools both for creating distributional vector representations and for performing specific NLP tasks. Three methods are available for creating embeddings: feedforward neural network, sentiment specific embeddings and embeddings based on counts and Hellinger PCA. Two methods are provided for training a network to perform sequence tagging, a window approach and a convolutional approach. The window approach is used for implementing a POS tagger and a NER tagger, the convolutional network is used for Semantic Role Labeling. The library is implemented in Python with core numerical processing written in C++ using parallel linear algebra library for efficiency and scalability. }} @article{tammewar-singla-15_leveraging-lexical-information-in-SMT, title={Methods for Leveraging Lexical Information in SMT}, author={Aniruddha Tammewar and Karan Singla and Bhasha Agrawal and Riyaz Bhat and Dipti Misra Sharma}, year={2015}, journal = {Workshop on Statistical Parsing of Morphologically Rich Languages (SPMRL 2015)}, pages = {21–30}, annote = { Abstract Word Embeddings have shown to be useful in wide range of NLP tasks. We explore the methods of using the embeddings in Dependency Parsing of Hindi, a MoR-FWO (morphologically rich, relatively freer word order) language and show that they not only help improve the quality of parsing, but can even act as a cheap alternative to the traditional features which are costly to acquire. We demonstrate that if we use distributed representation of lexical items instead of features produced by costly tools such as Morphological Analyzer, we get competitive results. This implies that only mono-lingual corpus will suffice to produce good accuracy in case of resource poor languages for which these tools are unavailable. We also explored the importance of these representations for domain adaptation. --- We used MaltParser(version 1.8)3 (Nivre et al., 2007) for training the parser and Word2Vec4 for vector space modeling. Chen and Manning (2014) have recently proposed a way of learning a neural network classifier for use in a greedy, transition-based dependency parser. They used small number of dense features, unlike most of the current parsers which use millions of sparse indicator features. The small number of features makes it work very fast. Use of low dimensional dense word embeddings as features in place of sparse features has recently been used in many NLP tasks and successfully shown improvements in terms of both, time and accuracies. The recent works include POS tagging (Collobert et al., 2011), Machine Translation (Devlin et al., 2014) and Constituency Parsing (Socher et al., 2013). These dense, continuous word-embeddings give strength to the words to be used more effectively in statistical approaches. The problem of data sparsity is reduced and also similarity between words can now be calculated using vectors. }} @article{bisk-hockenmaier-15_probing-unsupervised-grammar-induction, title={Probing the Linguistic Strengths and Limitations of Unsupervised Grammar Induction}, author={Bisk, Yonatan and Hockenmaier, Julia} journal={Proceedings of ACL}, year={2015},, annote = { Abstract Work in grammar induction should help shed light on the amount of syntactic structure that is discoverable from raw word or tag sequences. But since most current grammar induction algorithms produce unlabeled dependencies, it is difficult to analyze what types of constructions these algorithms can or cannot capture, and, therefore, to identify where additional supervision may be necessary. This paper provides an in-depth analysis of the errors made by unsupervised CCG parsers by evaluating them against the labeled dependencies in CCGbank, hinting at new research directions necessary for progress in grammar induction. }} @inproceedings{dong-wei-15_question-answering-over-freebase-multi-column-CNN, title={Question Answering over Freebase with Multi-Column Convolutional Neural Networks}, author={Dong, Li and Wei, Furu and Zhou, Ming and Xu, Ke}, journal={Proceedings of ACL}, year={2015}, annote = { ... we introduce the multi-column convolutional neural networks (MCCNNs) to automatically analyze questions from multiple aspects. Specifically, the model shares the same word embeddings to represent question words. MCCNNs use different column networks to extract answer types, relations, and context information from the input questions. The entities and relations in the knowledge base (namely FREEBASE in our experiments) are also represented as low-dimensional vectors. Then, a score layer is employed to rank candidate answers according to the representations of questions and candidate answers. The proposed information extraction based method utilizes question-answer pairs to automatically learn the model without relying on manually annotated logical forms and hand-crafted features. We also do not use any pre-defined lexical triggers and rules. In addition, the question paraphrases are also used to train networks and generalize for the unseen words in a multi-task learning manner. We have conducted extensive experiments on WEB QUESTIONS. Experimental results illustrate that our method outperforms several baseline systems. ... The models share the same word embeddings, and have multiple columns of convolutional neural networks. The number of columns is set to three in our QA task. These columns are used to analyze different aspects of a question, i.e., answer path, answer context, and answer type. The vector representations learned by these columns are denoted as f1 (q) , f2 (q) , f3 (q). We also learn embeddings for the candidate answers appeared in FREEBASE. Abstract Answering natural language questions over a knowledge base is an important and challenging task. Most of existing systems typically rely on hand-crafted features and rules to conduct question understanding and/or answer ranking. In this paper, we introduce multi-column convolutional neural networks (MCCNNs) to understand questions from three different aspects (namely, answer path, answer context, and answer type) and learn their distributed representations. Meanwhile, we jointly learn low-dimensional embeddings of entities and relations in the knowledge base. Question-answer pairs are used to train the model to rank candidate answers. We also leverage question paraphrases to train the column networks in a multi-task learning manner. We use Freebase as the knowledge base and conduct extensive experiments on the WebQuestions dataset. Experimental results show that our method achieves better or comparable performance compared with baseline systems. In addition, we develop a method to compute the salience scores of question words in different column networks. The results help us intuitively understand what MCCNNs learn. }} @article{liu-jiang-15_learning-semantic-word-embeddings-from-ordinal-knowledge, title={Learning semantic word embeddings based on ordinal knowledge constraints}, author={Liu, Quan and Jiang, Hui and Wei, Si and Ling, Zhen-Hua and Hu, Yu}, journal={Proceedings of ACL, Beijing, China}, year={2015}, annote = { we represent semantic knowledge as many ordinal ranking inequalities and formulate the learning of semantic word embeddings (SWE) as a constrained optimization problem, where the data-derived objective function is optimized subject to all ordinal knowledge inequality constraints extracted from available knowledge resources such as Thesaurus and Word- Net. We have demonstrated that this constrained optimization problem can be efficiently solved by the stochastic gradient descent (SGD) algorithm, even for a large number of inequality constraints. Experimental results on four standard NLP tasks, including word similarity measure, sentence completion, name entity recognition, and the TOEFL synonym selection, have all demonstrated that the quality of learned word vectors can be significantly improved after semantic knowledge is incorporated as inequality constraints during the learning process of word embeddings. }} @inproceedings{chen-lin-15_revisiting-word-embedding_contrasting-meaning, title={Revisiting word embedding for contrasting meaning}, author={Chen, Zhigang and Lin, Wei and Chen, Qian and Chen, Xiaoping and Wei, Si and Zhu, Xiaodan and Jiang, Hui}, booktitle={Proceedings of ACL}, year={2015}, annote = { }} @inproceedings{guo-wang-15_semantically-smooth-knowledge-graph-embedding, title={Semantically Smooth Knowledge Graph Embedding}, author={Guo, Shu and Wang, Quan and Wang, Bin and Wang, Lihong and Guo, Li}, booktitle={Proceedings of ACL}, year={2015}, annote = { This paper considers the problem of embedding Knowledge Graphs (KGs) consisting of entities and relations into low-dimensional vector spaces. Most of the existing methods perform this task based solely on observed facts. The only requirement is that the learned embeddings should be compatible within each individual fact. In this paper, aiming at further discovering the intrinsic geometric structure of the embedding space, we propose Semantically Smooth Embedding (SSE). The key idea of SSE is to take full advantage of additional semantic information and enforce the embedding space to be semantically smooth, i.e., entities belonging to the same semantic category will lie close to each other in the embedding space. Two manifold learning algorithms Laplacian Eigenmaps and Locally Linear Embedding are used to model the smoothness assumption. Both are formulated as geometrically based regularization terms to constrain the embedding task. We empirically evaluate SSE in two benchmark tasks of link prediction and triple classification, and achieve significant and consistent improvements over state-of-the-art methods. Furthermore, SSE is a general framework. The smoothness assumption can be imposed to a wide variety of embedding models, and it can also be constructed using other information besides entities' semantic categories. }} @inproceedings{iacobacci-pilehvar-15_sensembed-embeddings-for-word-relations, title={SensEmbed: Learning sense embeddings for word and relational similarity}, author={Iacobacci, Ignacio and Pilehvar, Mohammad Taher and Navigli, Roberto}, booktitle={Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics, Beijing, China}, year={2015} annote = { Word embeddings have recently gained considerable popularity for modeling words in different Natural Language Processing (NLP) tasks including semantic similarity measurement. However, notwithstanding their success, word embeddings are by their very nature unable to capture polysemy, as different meanings of a word are conflated into a single representation. In addition, their learning process usually relies on massive corpora only, preventing them from taking advantage of structured knowledge. We address both issues by proposing a multi-faceted approach that transforms word embeddings to the sense level and leverages knowledge from a large semantic network for effective semantic similarity measurement. We evaluate our approach on word similarity and relational similarity frameworks, reporting state-of-the-art performance on multiple datasets. }} @article{wang-15_sentiment-aspect-w-restricted-boltzmann-machines, title={Sentiment-Aspect Extraction based on Restricted Boltzmann Machines}, author={Wang, Linlin and Liu, Kang and Cao, Zhu and Zhao, Jun and de Melo, Gerard}, journal={Proceedings of ACL}, year={2015}, annote = { Regarding aspect identification, previous methods can be divided into three main categories: rule-based, supervised, and topic model-based methods. For instance, association rule-based methods (Hu and Liu, 2004; Liu et al., 1998) tend to focus on extracting product feature words and opinion words but neglect connecting product features at the aspect level. Existing rule-based methods typically are not able to group the extracted aspect terms into categories. Supervised (Jin et al., 2009; Choi and Cardie, 2010) and semisupervised learning methods (Zagibalov and Carroll, 2008; Mukherjee and Liu, 2012) were introduced to resolve certain aspect identification problems. However, supervised training requires handlabeled training data and has trouble coping with domain adaptation scenarios. Hence, unsupervised methods are often adopted to avoid this sort of dependency on labeled data. Latent Dirichlet Allocation, or LDA for short, (Blei et al., 2003) performs well in automatically extracting aspects and grouping corresponding representative words into categories. Thus, a number of LDA-based aspect identification approaches have been proposed in recent years (Brody and Elhadad, 2010; Titov and McDonald, 2008; Zhao et al., 2010). Still, these methods have several important drawbacks. First, inaccurate approximations of the distribution over topics may reduce the computational accuracy. Second, mixture models are unable to exploit the co-occurrence of topics to yield high probability predictions for words that are sharper than the distributions predicted by individual topics (Hinton and Salakhutdinov, 2009). To overcome the weaknesses of existing methods and pursue the promising direction of jointly learning aspect and sentiment, we present the novel Sentiment-Aspect Extraction RBM (SERBM) model to simultaneously extract aspects of entities and relevant sentiment-bearing words. This two-layer structure model is inspired by conventional Restricted Boltzmann machines (RBMs). In previous work, RBMs with shared parameters (RSMs) have achieved great success in capturing distributed semantic representations from text (Hinton and Salakhutdinov, 2009). Abstract Aspect extraction and sentiment analysis of reviews are both important tasks in opinion mining. We propose a novel sentiment and aspect extraction model based on Restricted Boltzmann Machines to jointly address these two tasks in an unsupervised setting. This model reflects the generation process of reviews by introducing a heterogeneous structure into the hidden layer and incorporating informative priors. Experiments show that our model outperforms previous state-of-the-art methods. }} @inproceedings{kennington-schlangen-15_compositional_perceptually-grounded-word-learning, title={Simple Learning and Compositional Application of Perceptually Grounded Word Meanings for Incremental Reference Resolution}, author={Kennington, Casey and Schlangen, David}, booktitle={Proceedings of the Conference for the Association for Computational Linguistics (ACL)}, year={2015}, annote = { }} @article{peng-chang-roth-15-CONLL_joint-coreference-and-mention-head-detection, title={A Joint Framework for Coreference Resolution and Mention Head Detection}, author={Peng, Haoruo and Chang, Kai-Wei and Roth, Dan}, journal={CoNLL 2015}, volume={51}, pages={12}, year={2015}, annote = { "mention heads": “the incumbent [Barack Obama]” --> “Barack Obama” “[officials] at the Pentagon” --> “officials” Mention heads can be used as auxiliary structures for coreference. .... the problem of identifying mention heads is a sequential phrase identification problem, and we choose to employ the BILOU-representation as it has advantages over traditional BIO-representation, as shown, e.g. in Ratinov and Roth (2009). The BILOU representation suggests learning classifiers that identify the B eginning, I nside and L ast tokens of multi-token chunks as well as U nit-length chunks. The problem is then transformed into a simple, but constrained, 5-class classification problem. }} @article{ishiwatari-kaji-15-CONLL_accurate-cross-lingual-projection_word-vectors, title={Accurate Cross-lingual Projection between Count-based Word Vectors by Exploiting Translatable Context Pairs}, author={Ishiwatari, Shonosuke and Kaji, Nobuhiro and Yoshinaga, Naoki and Toyoda, Masashi and Kitsuregawa, Masaru}, journal={CoNLL 2015}, pages={300}, year={2015}, annote = { **** Abstract We propose a method that learns a crosslingual projection of word representations from one language into another. Our method utilizes translatable context pairs as bonus terms of the objective function. In the experiments, our method outperformed existing methods in three language pairs, (English, Spanish), (Japanese, Chinese) and (English, Japanese), without using any additional supervisions. }} @article{qu-L-ferraro-baldwin-15-CONLL_big-data-domain-word-repr_sequence-labelling, title={Big Data Small Data, In Domain Out-of Domain, Known Word Unknown Word: The Impact of Word Representation on Sequence Labelling Tasks}, author={Qu, Lizhen and Ferraro, Gabriela and Zhou, Liyuan and Hou, Weiwei and Schneider, Nathan and Baldwin, Timothy}, journal={CoNLL 2015}, year={2015}, annote = { Abstract Word embeddings — distributed word representations that can be learned from unlabelled data — have been shown to have high utility in many natural language processing applications. In this paper, we perform an extrinsic evaluation of four popular word embedding methods in the context of four sequence labelling tasks: part-of-speech tagging, syntactic chunking, named entity recognition, and multiword expression identification. A particular focus of the paper is analysing the effects of task-based updating of word representations. We show that when using word embeddings as features, as few as several hundred training instances are sufficient to achieve competitive results, and that word embeddings lead to improvements over out-of-vocabulary words and also out of domain. Perhaps more surprisingly, our results indicate there is little difference between the different word embedding methods, and that simple Brown clusters are often competitive with word embeddings across all tasks we consider. Possible project: POS-Tagging for Hindi w word vector models }} @inproceedings{johannsen-hovy-15-CONLL_cross-lingual-syntactic-variation_age-gender, title={Cross-lingual syntactic variation over age and gender}, author={Johannsen, Anders and Hovy, Dirk and S{\o}gaard, Anders}, booktitle={Proceedings of CoNLL}, year={2015}, annote = { Abstract Most computational sociolinguistics studies have focused on phonological and lexical variation. We present the first large-scale study of syntactic variation among demographic groups (age and gender) across several languages. We harvest data from online user-review sites and parse it with universal dependencies. We show that several age and gender-specific variations hold across languages, for example that women are more likely to use VP conjunctions. }} @article{duong-cohn-15-CONLL_cross-lingual-transfer-unsupervised-dependency-parsing, title={Cross-lingual Transfer for Unsupervised Dependency Parsing Without Parallel Data}, author={Duong, Long and Cohn, Trevor and Bird, Steven and Cook, Paul}, journal={CoNLL 2015}, pages={113}, year={2015}, annote = { Abstract Cross-lingual transfer has been shown to produce good results for dependency parsing of resource-poor languages. Although this avoids the need for a target language treebank, most approaches have still used large parallel corpora. However, parallel data is scarce for low-resource languages, and we report a new method that does not need parallel data. Our method learns syntactic word embeddings that generalise over the syntactic contexts of a bilingual vocabulary, and incorporates these into a neural network parser. We show empirical improvements over a baseline delexicalised parser on both the CoNLL and Universal Dependency Treebank datasets. We analyse the importance of the source languages, and show that combining multiple source-languages leads to a substantial improvement. }} @article{luong-kayser-15-CONLL_deep-neural-language-models-for-translation, title={Deep Neural Language Models for Machine Translation}, author={Luong, Minh-Thang and Kayser, Michael and Manning, Christopher D}, journal={CoNLL 2015}, pages={305}, year={2015}, annote = { Abstract Neural language models (NLMs) have been able to improve machine translation (MT) thanks to their ability to generalize well to long contexts. Despite recent successes of deep neural networks in speech and vision, the general practice in MT is to incorporate NLMs with only one or two hidden layers and there have not been clear results on whether having more layers helps. In this paper, we demonstrate that deep NLMs with three or four layers outperform those with fewer layers in terms of both the perplexity and the translation quality. We combine various techniques to successfully train deep NLMs that jointly condition on both the source and target contexts. When reranking nbest lists of a strong web-forum baseline, our deep models yield an average boost of 0.5 TER / 0.5 BLEU points compared to using a shallow NLM. Additionally, we adapt our models to a new sms-chat domain and obtain a similar gain of 1.0 TER / 0.5 BLEU points. }} @article{bogdanova-dos-15-CONLL_detecting-semantically-equivalent-questions, title={Detecting Semantically Equivalent Questions in Online User Forums}, author={Bogdanova, Dasha and dos Santos, C{\i}cero and Barbosa, Luciano and Zadrozny, Bianca}, journal={CoNLL 2015}, pages={123}, year={2015}, annote = { Abstract Two questions asking the same thing could be too different in terms of vocabulary and syntactic structure, which makes identifying their semantic equivalence challenging. This study aims to detect semantically equivalent questions in online user forums. We perform an extensive number of experiments using data from two different Stack Exchange forums. We compare standard machine learning methods such as Support Vector Machines (SVM) with a convolutional neural network (CNN). The proposed CNN generates distributed vector representations for pairs of questions and scores them using a similarity metric. We evaluate in-domain word embeddings versus the ones trained with Wikipedia, estimate the impact of the training set size, and evaluate some aspects of domain adaptation. Our experimental results show that the convolutional neural network with in-domain word embeddings achieves high performance even with limited training data. }} @article{mihaylov-georgiev-15-CONLL_finding-opinion-manipulation-trolls-in-forums, title={Finding Opinion Manipulation Trolls in News Community Forums}, author={Mihaylov, Todor and Georgiev, Georgi D and Ontotext, AD and Nakov, Preslav and HBKU, Qatar}, journal={CoNLL 2015}, pages={310}, year={2015}, annote = { Trolls: Humans inserted into a social forum to create discord... We considered as trolls users who were called such by at least n distinct users, and non-trolls if they have never been called so. Requiring that a user should have at least 100 comments in order to be interesting for our experiments left us with 317 trolls and 964 non-trolls. Here are two examples (translated): “To comment from ”Rozalina”: You, trolls, are so funny :) I saw the same signature under other comments:)” “To comment from ”Historama”: Murzi7, you know that you cannot manipulate public opinion, right?” [FN7. Murzi is the short for murzilka. According to series of recent publications in Bulgarian media, Russian Internet users reportedly use the term murzilka to refer to Internet trolls. Abstract The emergence of user forums in electronic news media has given rise to the proliferation of opinion manipulation trolls. Finding such trolls automatically is a hard task, as there is no easy way to recognize or even to define what they are; this also makes it hard to get training and testing data. We solve this issue pragmatically: we assume that a user who is called a troll by several people is likely to be one. We experiment with different variations of this definition, and in each case we show that we can train a classifier to distinguish a likely troll from a non-troll with very high accuracy, 82–95%, thanks to our rich feature set. }} @article{klinger-cimiano-15-CONLL_instance-selection_for_cross-lingual-sentiment, title={Instance Selection Improves Cross-Lingual Model Training for Fine-Grained Sentiment Analysis}, author={Klinger, Roman and Cimiano, Philipp}, journal={CoNLL 2015}, pages={153}, year={2015}, annote = { Abstract Scarcity of annotated corpora for many languages is a bottleneck for training finegrained sentiment analysis models that can tag aspects and subjective phrases. We propose to exploit statistical machine translation to alleviate the need for training data by projecting annotated data in a source language to a target language such that a supervised fine-grained sentiment analysis system can be trained. To avoid a negative influence of poor-quality translations, we propose a filtering approach based on machine translation quality estimation measures to select only high-quality sentence pairs for projection. We evaluate on the language pair German/English on a corpus of product reviews annotated for both languages and compare to in-target-language training. Projection without any filtering leads to 23% F1 in the task of detecting aspect phrases, compared to 41% F1 for in-target-language training. Our approach obtains up to 47% F1. Further, we show that the detection of subjective phrases is competitive to in-target-language training without filtering. }} @article{cotterell-muller-15-CONLL_labeled-morphological-segmentation-HMM, title={Labeled Morphological Segmentation with Semi-Markov Models}, author={Cotterell, Ryan and M{\"u}ller, Thomas and Fraser, Alexander and Sch{\"u}tze, Hinrich}, journal={CoNLL 2015}, pages={164}, year={2015}, annote = { Abstract We present labeled morphological segmentation—an alternative view of morphological processing that unifies several tasks. We introduce a new hierarchy of morphotactic tagsets and CHIPMUNK, a discriminative morphological segmentation system that, contrary to previous work, explicitly models morphotactics. We show improved performance on three tasks for all six languages: (i) morphological segmentation, (ii) stemming and (iii) morphological tag classification. For morphological segmentation our method shows absolute improvements of 2-6 points F1 over a strong baseline. }} @article{maillard-clark-15-CONLL_learning-adjectives-w-tensor-skip-gram, title={Learning Adjective Meanings with a Tensor-Based Skip-Gram Model}, author={Maillard, Jean and Clark, Stephen}, journal={CoNLL 2015}, pages={327}, year={2015}, annote = { *** combining compositional and distributional semantics. A proposal for the case of adjective-noun combinations is given by Baroni and Zamparelli (2010) (and also Guevara (2010)). Their model represents adjectives as matrices over noun space, trained via linear regression to approximate the “holistic” adjective-noun vectors from the corpus. In this paper we propose a new solution to the problem of learning adjective meaning representations. The model is an implementation of the tensor framework of Coecke et al. (2011), here applied to adjective-noun combinations as a starting point. 2 A tensor-based skip-gram model Our model treats adjectives as linear maps over the vector space of noun meanings, encoded as matrices. The algorithm works in two stages: the first stage learns the noun vectors, as in a standard skipgram model, and the second stage learns the adjective matrices, given fixed noun vectors. }} @article{yin-W-schuetze-14-ACL_exploration-of-embeddings-for-generalized-phrases, title={An Exploration of Embeddings for Generalized Phrases}, author={Yin, Wenpeng and Sch{\"u}tze, Hinrich}, journal={ACL 2014}, pages={41}, year={2014}, annote = { }} @article{yin-W-schuetze-15-CONLL_multichannel-convolution-for-sentence-vector, title={Multichannel Variable-Size Convolution for Sentence Classification}, author={Wenpeng Yin and Hinrich Schütze }, journal={CoNLL 2015}, year={2015}, annote = { Word embeddings are derived by projecting words from a sparse, 1-of-V encoding (V : vocabulary size) onto a lower dimensional and dense vector space via hidden layers and can be interpreted as feature extractors that encode semantic and syntactic features of words. Many papers study the comparative performance of different versions of word embeddings, usually learned by different neural network (NN) architectures. We want to leverage this diversity of different embedding versions to extract higher quality sentence features and thereby improve sentence classification performance. The letters “M” and “V” in the name “MVCNN” of our architecture denote the multichannel and variable-size convolution filters, respectively. “Multichannel” employs language from computer vision where a color image has red, green and blue channels. Here, a channel is a description by an embedding version. In sum, we attribute the success of MVCNN to: (i) designing variable-size convolution filters to extract variable-range features of sentences and (ii) exploring the combination of multiple public embedding versions to initialize words in sentences. We also employ two “tricks” to further enhance system performance: mutual learning and pretraining. Multichannel Input. The input of MVCNN includes multichannel feature maps of a considered sentence, each is a matrix initialized by a different embedding version. Let s be sentence length, d dimension of word embeddings and c the total number of different embedding versions (i.e., channels). Hence, the whole initialized input is a three-dimensional array of size c⇥d⇥s. Figure 1 depicts a sentence with s = 12 words. Each word is initialized by c = 5 embeddings, each coming from a different channel. In implementation, sentences in a mini-batch will be padded to the same length, and unknown words for corresponding channel are randomly initialized or can acquire good initialization from the mutual-learning phase described in next section. }} @article{weegar-astrom-15-CONLL_linking-entities-across-images-and-text, title={Linking Entities Across Images and Text}, author={Weegar, Rebecka and Astr{\"o}m, Kalle and Nugues, Pierre}, journal={CoNLL 2015}, pages={185}, year={2015}, annote = { *** get IAPR TC-12 dataset Figure 1 [image of clouds and grass etc. shows an example of an image from the Segmented and Annotated IAPR TC-12 data set (Escalantea et al., 2010). It has four regions labeled cloud, grass, hill, and river, and the caption: a flat landscape with a dry meadow in the foreground, a lagoon behind it and many clouds in the sky Karpathy et al. (2014) proposed a system for retrieving related images and sentences. They used neural networks and they show that the results are improved if image objects and sentence fragments are included in the model. Sentence fragments are extracted from dependency graphs, where each edge in the graphs corresponds to a fragment. Abstract This paper describes a set of methods to link entities across images and text. As a corpus, we used a data set of images, where each image is commented by a short caption and where the regions in the images are manually segmented and labeled with a category. We extracted the entity mentions from the captions and we computed a semantic similarity between the mentions and the region labels. We also measured the statistical associations between these mentions and the labels and we combined them with the semantic similarity to produce mappings in the form of pairs consisting of a region label and a caption entity. In a second step, we used the syntactic relationships between the mentions and the spatial relationships between the regions to rerank the lists of candidate mappings. To evaluate our methods, we annotated a test set of 200 images, where we manually linked the image regions to their corresponding mentions in the captions. Eventually, we could match objects in pictures to their correct mentions for nearly 89 percent of the segments, when such a matching exists. }} @article{chatterji-sarkar-basu-14_dependency-annotation_bangla-treebank, title={A dependency annotation scheme for Bangla treebank}, author={Chatterji, Sanjay and Sarkar, Tanaya Mukherjee and Dhang, Pragati and Deb, Samhita and Sarkar, Sudeshna and Chakraborty, Jayshree and Basu, Anupam}, journal={Language Resources and Evaluation}, volume={48}, number={3}, pages={443--477}, year={2014}, publisher={Springer}, annote = { }} @inproceedings{al-Badrashiny-eskander-14_automatic-transliteration_romanized-arabic, title={Automatic transliteration of romanized dialectal arabic}, author={Al-Badrashiny, Mohamed and Eskander, Ramy and Habash, Nizar and Rambow, Owen}, booktitle={Proceedings of the Eighteenth Conference on Computational Natural Language Learning}, pages={30--38}, year={2014}, annote = { }} @article{levy-goldberg-14_linguistic-regularities-in-sparse-explicit-word-reprs, title={Linguistic regularities in sparse and explicit word representations}, author={Levy, Omer and Goldberg, Yoav and Ramat-Gan, Israel}, journal={CoNLL-2014}, pages={171}, year={2014}, annote = { Recent work has shown that neural-embedded word representations capture many relational similarities, which can be recovered by means of vector arithmetic in the embedded space. We show that Mikolov et al.’s method of first adding and subtracting word vectors, and then searching for a word similar to the result, is equivalent to searching for a word that maximizes a linear combination of three pairwise word similarities. Based on this observation, we suggest an improved method of recovering relational similarities, improving the state-of-the-art results on two recent word-analogy datasets. Moreover, we demonstrate that analogy recovery is not restricted to neural word embeddings, and that a similar amount of relational similarities can be recovered from traditional distributional word representations related work with datasets and code: https://levyomer.wordpress.com/2015/06/23/learning-to-exploit-structured-resources-for-lexical-inference/ }} @inproceedings{kulkarni-alRfou-perozzi-15_statistically-significant-linguistic-change, title={Statistically Significant Detection of Linguistic Change}, author={Kulkarni, Vivek and Al-Rfou, Rami and Perozzi, Bryan and Skiena, Steven}, booktitle={Proceedings of the 24th International Conference on World Wide Web}, pages={625--635}, year={2015}, organization={International World Wide Web Conferences Steering Committee}, annote = { We propose a new computational approach for tracking and detecting statistically significant linguistic shifts in the meaning and usage of words. Such linguistic shifts are especially prevalent on the Internet, where the rapid exchange of ideas can quickly change a word's meaning. Our meta-analysis approach constructs property time series of word usage, and then uses statistically sound change point detection algorithms to identify significant linguistic shifts. We consider and analyze three approaches of increasing complexity to generate such linguistic property time series, the culmination of which uses distributional characteristics inferred from word co-occurrences. Using recently proposed deep neural language models, we first train vector representations of words for each time period. Second, we warp the vector spaces into one unified coordinate system. Finally, we construct a distance-based distributional time series for each word to track its linguistic displacement over time. We demonstrate that our approach is scalable by tracking linguistic change across years of micro-blogging using Twitter, a decade of product reviews using a corpus of movie reviews from Amazon, and a century of written books using the Google Book Ngrams. Our analysis reveals interesting patterns of language usage change commensurate with each medium. }} @article{alRfou-kulkarni-14_polyglot-NER-massive-multilingual-named-entity-recognition, title={POLYGLOT-NER: Massive Multilingual Named Entity Recognition}, author={Al-Rfou, Rami and Kulkarni, Vivek and Perozzi, Bryan and Skiena, Steven}, journal={SIAM Intl Conf Data Mining SDM 2015}, year={2014}, annote = { also: arXiv preprint arXiv:1410.3791 We build a Named Entity Recognition system (NER) for 40 languages using only language agnostic methods. Our system relies only on un-supervised methods for feature generation. We obtain training data for the task of NER through a semi-supervised technique not relying whatsoever on any language specific or orthographic features. This approach allows us to scale to large set of languages for which little human expertise and human annotated training data is available }} @incollection{perozzi-alRfou-14_inducing-language-networks-from-continuous-word-reprs, title={Inducing language networks from continuous space word representations}, author={Perozzi, Bryan and Al-Rfou, Rami and Kulkarni, Vivek and Skiena, Steven}, booktitle={Complex Networks V}, pages={261--273}, year={2014}, publisher={Springer}, annote = { We induced networks on continuous space representations of words over the Polyglot and Skipgram models. We compared the structural properties of these networks and demonstrate that these networks differ from networks constructed through other run of the mill methods. We also demonstrated that these networks exhibit a rich and varied community structure. }} @article{xiao-guo-15-CONLL_annotation-projection-for-cross-lingual-parsing, title={Annotation Projection-based Representation Learning for Cross-lingual Dependency Parsing}, author={Xiao, Min and Guo, Yuhong}, journal={CoNLL 2015}, pages={73}, year={2015}, annote = { Abstract Cross-lingual dependency parsing aims to train a dependency parser for an annotation-scarce target language by exploiting annotated training data from an annotation-rich source language, which is of great importance in the field of natural language processing. In this paper, we propose to address cross-lingual dependency parsing by inducing latent crosslingual data representations via matrix completion and annotation projections on a large amount of unlabeled parallel sentences. To evaluate the proposed learning technique, we conduct experiments on a set of cross-lingual dependency parsing tasks with nine different languages. The experimental results demonstrate the efficacy of the proposed learning method for cross-lingual dependency parsing. }} @article{bhatt-semwal-15-CONLL_iterative-similarity_cross-domain-text-classification, title={An Iterative Similarity based Adaptation Technique for Cross Domain Text Classification}, author={Bhatt, Himanshu S and Semwal, Deepali and Roy, Shourya}, journal={CoNLL 2015}, pages={52}, year={2015}, annote = { }} @article{lim-T-soon-LK-14_lexicon+-tx_rapid-multilingual-lexicon-for-under-resourced-lgs, title={Lexicon+ TX: rapid construction of a multilingual lexicon with under-resourced languages}, author={Lim, Lian Tze and Soon, Lay-Ki and Lim, Tek Yong and Tang, Enya Kong and Ranaivo-Malan{\c{c}}on, Bali}, journal={Language Resources and Evaluation}, volume={48}, number={3}, pages={479--492}, year={2014}, publisher={Springer}, annote = { }} @inproceedings{bhatt-narasimhan-misra-09_multi-representational-treebank-for-hindi-urdu, title={A multi-representational and multi-layered treebank for hindi/urdu}, author={Bhatt, Rajesh and Narasimhan, Bhuvana and Palmer, Martha and Rambow, Owen and Sharma, Dipti Misra and Xia, Fei}, booktitle={Proceedings of the Third Linguistic Annotation Workshop}, pages={186--189}, year={2009}, organization={Association for Computational Linguistics}, annote = { This paper describes the simultaneous development of dependency structure and phrase structure treebanks for Hindi and Urdu, as well as a PropBank. The dependency structure and the PropBank are manually annotated, and then the phrase structure treebank is produced automatically. To ensure successful conversion the development of the guidelines for all three representations are carefully coordinated. }} @article{mikolov-le-Q-13_similarities-in-lg-for-machine-translation, title={Exploiting similarities among languages for machine translation}, author={Mikolov, Tomas and Le, Quoc V and Sutskever, Ilya}, journal={arXiv preprint arXiv:1309.4168}, year={2013}, annote = { }}