FAQ Database Discussion Community


Python convert list of multiple words to single words

python,nlp,nltk
I have a list of words for example: words = ['one','two','three four','five','six seven'] # quote was missing And I am trying to create a new list where each item in the list is just one word so I would have: words = ['one','two','three','four','five','six','seven'] Would the best thing to do be...

Add Features to An Sklearn Classifier

python,machine-learning,nlp,scikit-learn
I'm building a SGDClassifier, and using a tfidf transformer. Aside from the features created from tfidf, I'd also like to add additional features like document length or other ratings. How can I add these features to the feature-set? Here is how the classifier is constructed in a pipeline: data =...

Implementing Naive Bayes text categorization but I keep getting zeros

python,algorithm,nlp,text-classification,naivebayes
I am using Naive Bayes for text categorization this is how I created the initial weights for each term in the specified category: term1:number of times term 1 exists/number of documents in categoryA term2:number of times term 2 exists/number of documents in categoryA term3:number of times term 3 exists/number of...

How to remove a custom word pattern from a text using NLTK with Python

python,regex,nlp,nltk,tokenize
I am currently working on a project of analyzing the quality examination paper questions.In here I am using Python 3.4 with NLTK. So first I want to take out each question separately from the text.The question paper format is given below. (Q1). What is web 3.0? (Q2). Explain about blogs....

Transform column of strings with word cluster values

r,nlp,cluster-computing
I am doing some basic NLP work in R. I have two data sets and want to replace the words in one with the cluster value of each word from the other. The first data set holds sentences and the second one the cluster value for each word (assume that...

How to not split English into separate letters in the Stanford Chinese Parser

python,nlp,stanford-nlp,segment,chinese-locale
I am using the Stanford Segmenter at http://nlp.stanford.edu/software/segmenter.shtml in Python. For the Chinese segmenter, whenever it encounters a English word, it will split the word into many characters one by one, but I want to keep the characters together after the segmentation is done. For example: 你好abc我好 currently will become...

Entity extraction API for Android

android,nlp
I have extracted text successfully from image but now I have no idea how to extract the name, location, phone, and cell number from extracted text. Here is some example text that was extracted. Comsats Institute of Information technology,Abbottabad. Dr Usama Ijaz bajwa Assistant Professor Phone:+92 321 6647911 ...

Navigate an NLTK tree (follow-up)

python,tree,nlp,nltk
I've asked the question how I can properly navigate through an NTLK tree. How do I properly navigate through an NLTK tree (or ParentedTree)? I would like to identify a certain leaf with the parent node "VBZ", then I would like to move from there further up the tree and...

StanfordNLP lemmatization cannot handle -ing words

java,nlp,stanford-nlp,stemming,lemmatization
I've been experimenting with Stanford NLP toolkit and its lemmatization capabilities. I am surprised how it lemmatize some words. For example: depressing -> depressing depressed -> depressed depresses -> depress It is not able to transform depressing and depressed into the same lemma. Simmilar happens with confusing and confused, hopelessly...

NLTK getting dependencies from raw text

python-2.7,nlp,nltk
I need get dependencies in sentences from raw text using NLTK. As far as I understood, stanford parser allows us just to create tree, but how to get dependencies in sentences from this tree I didn't find out (maybe it's possible, maybe not) So I've started using MaltParser. Here is...

Stanford coreNLP : can a word in a sentence be part of multiple Coreference chains

nlp,stanford-nlp
The question is in the title. Using Stanford's NLP coref module, I am wondering if a given word can be part of multiple coreference chains. Or can it only be part of one chain. Could you give me examples of when this might occur. Similarly, can a word be part...

How Can I Access the Brown Corpus in Java (aka outside of NLTK)

java,nlp,nltk,corpus,tagged-corpus
I'm trying to write a program that makes use of natural language parts-of-speech in Java. I've been searching on Google and haven't found the entire Brown Corpus (or another corpus of tagged words). I keep finding NLTK information, which I'm not interested in. I want to be able to load...

Is it possible to get boost locale boundary analysis to split on apostrophes?

c++,boost,nlp,icu,boost-locale
For example consider the following code: using namespace boost::locale::boundary; boost::locale::generator gen; std::string text = "L'homme qu'on aimait trop."; ssegment_index map(word, text.begin(), text.end(), gen("fr_FR.UTF-8")); for (ssegment_index::iterator it = map.begin(), e = map.end(); it != e; ++it) std::cout << "\"" << *it << "\", "; std::cout << std::endl; This outputs: "L'homme", "...

Abbreviation Reference for NLTK Parts of Speach

python,nlp,nltk
I'm using nltk to find the parts of speech for each word in a sentence. It returns abbreviations that I both can't fully intuit and can't find good documentation for. Running: import nltk sample = "There is no spoon." tokenized_words = nltk.word_tokenize(sample) tagged_words = nltk.pos_tag(tokenized_words) print tagged_words Returns: [('There', 'EX'),...

How to deserialize a CoNLL format dependency tree with ClearNLP?

java,nlp,deserialization,parse-tree,clearnlp
Dependency parsing using ClearNLP creates a DEPTree object. I have parsed a large corpus and serialized all the data in CoNLL format (e.g., this ClearNLP page on Google code). But I can't figure out how to deserialize them. ClearNLP provides a DEPTree#toStringCoNLL() method (scroll down this page to see it)....

Way to dump the relations from Freebase?

nlp,semantic-web,freebase,dbpedia,wikidata
I have ran through the Google API for Freebase, but still confusing. Is there simple way to dump the relations from Freebase? I want to dump all entity-name-pair with a specific relation (e.g. marry_with, ...), and also want the chinese entity names. Should I write MQL to query all entity...

Save and reuse TfidfVectorizer in scikit learn

python,nlp,scikit-learn,pickle,text-mining
I am using TfidfVectorizer in scikit learn to create a matrix from text data. Now I need to save this object for reusing it later. I tried to use pickle, but it gave the following error. loc=open('vectorizer.obj','w') pickle.dump(self.vectorizer,loc) *** TypeError: can't pickle instancemethod objects I tried using joblib in sklearn.externals,...

Choosing correct word for the given string

nlp,stanford-nlp
Suppose the given word is" connnggggggrrrraaatsss" and we need to convert it to congrats . Or for other example is "looooooovvvvvveeeeee" should be changed to "love" . Here the given words can be repeated for any number of times but it should be changed to correct form. We need to...

extracting n grams from huge text

python,performance,nlp,bigdata,text-processing
For example we have following text: "Spark is a framework for writing fast, distributed programs. Spark solves similar problems as Hadoop MapReduce does but with a fast in-memory approach and a clean functional style API. ..." I need all possible section of this text respectively, for one word by one...

How to define a CAS in database as external resource for an annotator in uimaFIT?

nlp,data-mining,information-retrieval,uima
I am trying to structure my a data processing pipeline using uimaFit as follows: [annotatorA] => [Consumer to dump annotatorA's annotations from CAS into DB] [annotatorB (should take on annotatorA's annotations from DB as input)]=>[Consumer for annotatorB] The driver code: /* Step 0: Create a reader */ CollectionReader readerInstance= CollectionReaderFactory.createCollectionReader(...

Python NLTK pos_tag not returning the correct part-of-speech tag

python,machine-learning,nlp,nltk,pos-tagger
Having this: text = word_tokenize("The quick brown fox jumps over the lazy dog") And running: nltk.pos_tag(text) I get: [('The', 'DT'), ('quick', 'NN'), ('brown', 'NN'), ('fox', 'NN'), ('jumps', 'NNS'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'NN'), ('dog', 'NN')] This is incorrect. The tags for quick brown lazy in the sentence should be:...

Natural Language Search (user intent search)

nlp,search-engine,keyword,voice-recognition,naturallyspeaking
I'm trying to build a search engine that allows my users to search with natural language commands, just like Google Now. Except, my search engine is slightly more constrained, in that it is mainly going to be used within an e-commerce site, and allow the users to search for certain...

Getting the maximum common words in R

r,nlp
I have data of the form: ID A1 A2 A3 ... A100 1 john max karl ... kevin 2 kevin bosy lary ... rosy 3 karl lary bosy ... hale . . . 10000 isha john lewis ... dave I want to get one ID for each ID such that...

Text analysis : What after term-document matrix? [closed]

r,machine-learning,nlp,svm,text-mining
I am trying to build predictive models from text data. I built document-term matrix from the text data (unigram and bigram) and built different types of models on that (like svm, random forest, nearest neighbor etc). All the techniques gave decent results, but I want to improve the results. I...

Convert nl string to vector or some numeric equivalent

javascript,string,nlp
I'm trying to convert a string to a numeric equivalent so I can train a neural-network to classify the strings. I tried the sum of the ascii values, but that just results in larger numbers vs smaller numbers. For example, I could have a short string in german and it...

Problems obtaining most informative features with scikit learn?

python,pandas,machine-learning,nlp,scikit-learn
Im triying to obtain the most informative features from a textual corpus. From this well answered question I know that this task could be done as follows: def most_informative_feature_for_class(vectorizer, classifier, classlabel, n=10): labelid = list(classifier.classes_).index(classlabel) feature_names = vectorizer.get_feature_names() topn = sorted(zip(classifier.coef_[labelid], feature_names))[-n:] for coef, feat in topn: print classlabel, feat,...

Fast shell command to remove stop words in a text file

shell,nlp,text-processing
I have a 2GB text file. I am trying to remove frequently occurring english stop words from this file. I have stopwords.txt containing like this.. a an the for and I What is the fast method to do this using shell command such as tr, sed or awk? ...

NLP Shift reduce parser is throwing null pointer Exception for Sentiment calculation

nlp,stanford-nlp,sentiment-analysis,shift-reduce
i am trying to find out sentiments using nlp.The version i am using is 3.4.1. I have some junk data to process and it looks around 45 seconds to process using default PCFG file. here is the example String text = "Nm n n 4 n n bkj nun4hmnun Onn...

How to interpret scikit's learn confusion matrix and classification report?

machine-learning,nlp,scikit-learn,svm,confusion-matrix
I have a sentiment analysis task, for this Im using this corpus the opinions have 5 classes (very neg, neg, neu, pos, very pos), from 1 to 5. So I do the classification as follows: from sklearn.feature_extraction.text import TfidfVectorizer import numpy as np tfidf_vect= TfidfVectorizer(use_idf=True, smooth_idf=True, sublinear_tf=False, ngram_range=(2,2)) from sklearn.cross_validation...

NLP - Word Representations

machine-learning,nlp,artificial-intelligence
I am working on a Word representation algorithm, similar to Word2Vec and GloVe.I have been asked to make it more dynamic, such that new words could be added to the vocabulary,and new documents could be submitted to the program even after the representations (vectors) have been created. The problem is,...

Are word-vector orientations universal?

nlp,word2vec
I have recently been experimenting with Word2Vec and I noticed whilst trawling through forums that a lot of other people are also creating their own vectors from their own databases. This has made me curious as to how vectors look across databases and whether vectors take a universal orientation? I...

POS of WSJ in CONLL format from penn tree bank

nlp
I've got the penn tree bank CD. How to convert designated WSJ documents to conll format? Because the original format is in tree structure. E.g. The conll shared task 2000: http://www.cnts.ua.ac.be/conll2000/chunking/ is using treebank. How was this format obtained? Thank you!

Annotating a treebank with lexical information (Head Words) in JAVA

java,nlp,stanford-nlp,lexical-analysis
I have a treebank with syntactic parse tree for each sentence as given below: (S (NP (DT The) (NN government)) (VP (VBZ charges) (SBAR (IN that) (S (PP (IN between) (NP (NNP July) (CD 1971)) (CC and) (NP (NNP July) (CD 1992))) (, ,) (NP (NNP Rostenkowski)) (VP (VBD placed)...

Only ignore stop words for ngram_range=1

python,nlp,scikit-learn
I am using CountVectorizer from sklearn...looking to provide a list of stop words and apply the count vectorizer for ngram_range of (1,3). From what I can tell, if a word - say "me" - is in the list of stop words, then it doesn't get seen for higher ngrams i.e.,...

Coreference resolution using Stanford CoreNLP

java,nlp,stanford-nlp
I am new to the Stanford CoreNLP toolkit and trying to use it for a project to resolve coreferences in news texts. In order to use the Stanford CoreNLP coreference system, we would usually create a pipeline, which requires tokenization, sentence splitting, part-of-speech tagging, lemmarization, named entity recoginition and parsing....

Create Dictionary from Penn Treebank Corpus sample from NLTK?

python,dictionary,nlp,nltk,corpus
I know that the Treebank corpus is already tagged, but unlike the Brown corpus, I can't figure out how to get a dictionary of tags. For instance, >>> from nltk.corpus import brown >>> wordcounts = nltk.ConditionalFreqDist(brown.tagged_words()) This doesn't work on the Treebank corpus?...

Amazon Machine Learning for sentiment analysis

amazon-web-services,machine-learning,nlp,sentiment-analysis
How flexible or supportive is the Amazon Machine Learning platform for sentiment analysis and text analytics?

Chinese sentence segmenter with Stanford coreNLP

java,nlp,tokenize,stanford-nlp
I'm using the Stanford coreNLP system with the following command: java -cp stanford-corenlp-3.5.2.jar:stanford-chinese-corenlp-2015-04-20-models.jar -Xmx3g edu.stanford.nlp.pipeline.StanfordCoreNLP -props StanfordCoreNLP-chinese.properties -annotators segment,ssplit -file input.txt And this is working great on small chinese texts. However, I need to train a MT system which just requires me to segment my input. So I just need...

NLP- Sentiment Processing for Junk Data takes time

nlp,stanford-nlp,sentiment-analysis,pos-tagger
I am trying to find the Sentiment for the input text. This test is a junk sentence and when I tried to find the Sentiment the Annotation to parse the sentence is taking around 30 seconds. For normal text it takes less than a second. If i need to process...

How to extract brand from product name

python,machine-learning,nlp
I have two website and i have datas in my hands now i want to do analysis with that data I have two product name(Brand + Product name) i want to extract only brand name http://www.thehut.com/jeans-clothing/men/clothing/brave-soul-men-s-cardiff-jeans-denim/10741907.html In the above website the product name is Brave Soul Men's Swansea Jeans -...

Identify prepositons and individual POS

nlp,stanford-nlp
I am trying to find correct parts of speech for each word in paragraph. I am using Stanford POS Tagger. However, I am stuck at a point. I want to identify prepositions from the paragraph. Penn Treebank Tagset says that: IN Preposition or subordinating conjunction how, can I be sure...

What is the default behavior of Stanford NLP's WordsToSentencesAnnotator when splitting a text into sentences?

nlp,stanford-nlp
Looking at WordToSentenceProcessor.java, DEFAULT_BOUNDARY_REGEX = "\\.|[!?]+"; led me to think that the text would get split into sentences based on ., ! and ?. However, if I pass the string D R E L I N. Okay. as input, e.g. using the command line interface: java -cp "*" -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP...

nltk sentence tokenizer, consider new lines as sentence boundary

python,nlp,nltk,tokenize
I am using nltk's PunkSentenceTokenizer to tokenize a text to a set of sentences. However, the tokenizer doesn't seem to consider new paragraph or new lines as a new sentence. >>> from nltk.tokenize.punkt import PunktSentenceTokenizer >>> tokenizer = PunktSentenceTokenizer() >>> tokenizer.tokenize('Sentence 1 \n Sentence 2. Sentence 3.') ['Sentence 1 \n...

NLP: Arrange words with tags into proper English sentence?

nlp
lets say I have a sentence: "you hello how are ?" I get output of: you_PRP hello_VBP how_WRB are_VBP What is best way to arrange the wording into proper English sentence like: Hello how are you ? I am new to this whole natural language processing so I am unfamiliar...

NLP - Error while Tokenization and Tagging etc [duplicate]

java,nlp,stanford-nlp
This question already has an answer here: How to fix: Unsupported major.minor version 51.0 error? 30 answers I want to identify all the Tokens and also PartsOfSpeech Tagging using the Stanford NLP jar file. I have added all the required jar files into the build path of the project..The...

How to iterate through the synset list generated from wordnet using python 3.4.2

python-3.x,nlp,wordnet,sentiment-analysis
I am using wordnet to find the synonyms for a particular word as shown below synonyms = wn.synsets('good','a') where wn is wordnet. This returns a list of synsets like Synset('good.a.01') Synset('full.s.06') Synset('good.a.03') Synset('estimable.s.02') Synset('beneficial.s.01') etc... How to iterate through each synset and get the name and the pos tag of...

difference between Latent and Explicit Semantic Analysis

machine-learning,nlp
I'm trying to analyse the paper ''Computing Semantic Relatedness using Wikipedia-based Explicit Semantic Analysis''. One component of the system described therein that I'm currently grappling with is the difference between Latent and Explicit Semantic Analysis. I've been writing up a document to encapsulate my understanding but it's somewhat, "cobbled together",...

Annotator for Relationship Extraction

regex,nlp,nltk,stanford-nlp,gate
I have a set of urls in a text file. For each url in that text file, I want to tag the entities and relationships in the text contained in that url. I am aware of the entity taggers like Stanford NER, NLTK and GATE which can perform the entity...

lingpipe sentiment analysis tutorial demo error?

java,eclipse,nlp,lingpipe
I was doing the sentiment analysis a from lingpipe website tutorial, and I keep getting this error, is there anyone wo can help? java -cp "sentimentDemo.jar:../../../lingpipe e-4.1.0.jar" PolarityBasic file:///Users/dylan/Desktop/POLARITY_DIR/ BASIC POLARITY DEMO Data Directory=file:/Users/dylan/Desktop/POLARITY_DIR/txt_sentoken Thrown: java.lang.NullPointerException java.lang.NullPointerException at com.aliasi.classify.DynamicLMClassifier.createNGramProcess(DynamicLMClassifier.java:313) at...

Stanford NLP: Chinese Part of Speech labels?

python,nlp,stanford-nlp,pos-tagger,part-of-speech
I am trying to find a table explaining each label in the Chinese part-of-speech tagger for the 2015.1.30 version. I couldn't find anything on this topic. The closest thing I could find was in the "Morphological features help POS tagging of unknown words across language varieties" article, but it doesn't...

stemming words in python

python,nlp,stemming
I'm using this code to stem words, here is how it works, first there's a list of suffixes, the program checks if the word has the ending same as the one in the list if positive it removes the suffix, however, when I run the code I get this result:...

TF - IDF vs only IDF

nlp,ranking,tf-idf
Is there any case when IDF is better than TF-IDF? As far I understood TF is important to give a weight to a word within a document to match that document with a predefined query. If I'd like just to sort the importance of words in a collection of documents...

The ± 2 window in Word similarity of NLP

vector,nlp,distribution
There is a question illustrate below: //--------question start--------------------- Consider the following small corpus consisting of three sentences: The judge struck the gavel to silence the court. Buying the cheap saw is false economy. The nail was driven in when the hammer struck it hard. Use distributional similarity to determine whether...

Mecab output - list of name types

nlp,translation,pos-tagger,mecab
A sample output from meecab: に ニ ニ に 助詞-格助詞 We have 助詞(particle) as the type and 格助詞 (case-marking particle) as the PoS. Where can I find a list of all possible types and PoS's that mecab uses? I want to be able to map the Japanese to a translated...

How to replace a word by its most representative mention using Stanford CoreNLP Coreferences module

java,nlp,stanford-nlp
I am trying to figure out the way to rewrite sentences by "resolving" (replacing words with) their coreferences using Stanford Corenlp's Coreference module. The idea is to rewrite a sentence like the following : John drove to Judy’s house. He made her dinner. into John drove to Judy’s house. John...

Separately tokenizing and pos-tagging with CoreNLP

java,nlp,stanford-nlp
I'm having few problems with the way Stanford CoreNLP divides text into sentences, namely: It treats ! and ? (exclamation and question marks) inside a quoted text as a sentence end where it shouldn't, e.g.: He shouted "Alice! Alice!" - here it treats the ! after the first Alice as...

Get noun from verb Wordnet

python,nlp,wordnet
I'm trying to get the noun from a verb with Wordnet in python. I want to be able to get : from the verb 'created' the noun 'creator', 'funded' => 'funder' Verb X => Noun Y Y is referring to a person I've been able to do it the other...

Documentation of Moses (statistical machine translation) mose.ini file format?

machine-learning,nlp,machine-translation,moses
Is there any documentation of the moses.ini format for Moses? Running moses at the command line without arguments returns available feature names but not their available arguments. Additionally, the structure of the .ini file is not specified in the manual that I can see.

Stanford CoreNLP wrong coreference resolution

nlp,stanford-nlp
I am still playing with Stanford's CoreNLP and I am encountering strange results on a very trivial test of Coreference resolution. Given the two sentences : The hotel had a big bathroom. It was very clean. I would expect "It" in sentence 2 to be coreferenced by "bathroom" or at...

How to un-stem a word in Python?

python,nlp,nltk
I want to know if there is anyway that I can un-stem them to a normal form? The problem is that I have thousands of words in different forms e.g. eat, eaten, ate, eating and so on and I need to count the frequency of each word. All of these...

Tabulating characters with diacritics in R

r,unicode,nlp,linguistics
I'm trying to tabulate phones (characters) occurrences in a string, but diacritics are tabulated as characters on their own. Ideally, I have a wordlist in International Phonetic Alphabet, with a fair amount of diacritics and several combinations of them with base characters. I give here a MWE with just one...

NLTK fcfg sem value is awkward

python,nlp,nltk,context-free-grammar
My FCFG that I used for this sentence was S[SEM=<?vp(?np)>] -> NP[NUM=?n, SEM=?np] VP[NUM=?n,SEM=?vp] VP[NUM=?n,SEM=<?v(?obj)>] -> TV[NUM=?n,SEM=?v] DET NP[SEM=?obj NP[NUM=?n, SEM=?np] -> N[NUM=?n, SEM=?np] N[NUM=sg, SEM=<\P.P(I)>] -> 'I' TV[NUM=sg,SEM=<\x y.(run(y,x))>] -> 'run' DET -> "a" N[NUM=sg, SEM=<\P.P(race)>] -> 'race' I want to parse out the sentence "I run a race"...

software to extract word functions like subject, predicate, object etc

nlp,stanford-nlp
I need to extract relations of the words in a sentence. I'm mostly interested in identifying a subject, predicate and an object. For example, for the follwoing sentence: She gave him a pen I'd like to have: She_subject gave_predicate him a pen_object. Is Stanford NLP can do that? I've tried...

How to Break a sentence into a few words

python-2.7,parsing,nlp,nltk
I want to ask how to break a sentence into a few words, what this is using of NLP (Natural Language Processing) in python called NLTK or PARSER ?

Python NLTK - Making a 'Dictionary' from a Corpus and Saving the Number Tags

python,nlp,nltk,corpus,tagged-corpus
I'm not super experienced with Python, but I want to do some Data analytics with a corpus, so I'm doing that part in NLTK Python. I want to go through the entire corpus and make a dictionary containing every word that appears in the corpus dataset. I want to be...

filtering stopwords near punctuation

python,nlp,nltk
I am trying to filter out stopwords in my text like so: clean = ' '.join([word for word in text.split() if word not in (stopwords)]) The problem is that text.split() has elements like 'word.' that don't match to the stopword 'word'. I later use clean in sent_tokenize(clean), however, so I...

How to properly navigate an NLTK parse tree?

python,tree,nlp,nltk
NLTK is driving me nuts again. How do I properly navigate through an NLTK tree (or ParentedTree)? I would like to identify a certain leaf with the parent node "VBZ", then I would like to move from there further up the tree and to the left to identify the NP...

determination of human language from text:: system structure [closed]

java,nlp,language-detection
I'm using these word lists. Right now I'm only thinking about German, Russian, English, and French. I guess what I'm going to do is put them all as part of a hashmap, one for each language with the word as the key, and a boolean as the value. When I...

How to Identify mentions in a text?

nlp,stanford-nlp
I am looking for rule-based methods or any other methods to identify all mentions in a text. I have found several libraries that give coreferences but no exact options for only mentions. What I want is something like below: Input text: [This painter]'s indulgence of visual fantasy, and appreciation of...

How to use serialized CRFClassifier with StanfordCoreNLP prop 'ner'

java,nlp,stanford-nlp
I'm using the StanfordCoreNLP API interface to programmatically do some basic NLP. I need to train a model on my own corpus, but I'd like to use the StanfordCoreNLP interface to do it, because it handles a lot of the dry mechanics behind the scenes and I don't need much...

Stanford NLP - Using Parsed or Tagged text to generate Full XML

parsing,nlp,stanford-nlp,pos-tagging
I'm trying to extract data from the PennTreeBank, Wall Street Journal corpus. Most of it already has the parse trees, but some of the data is only tagged. i.e. wsj_DDXX.mrg and wsj_DDXX.pos files. I would like to use the already parsed trees and tagged data in these files so as...

What exactly is the difference between AnalysisEngine and CAS Consumer?

nlp,uima
I'm learning UIMA, and I can create basic analysis engines and get results. But What I'm finding it difficult to understand is use of CAS consumers. At the same time I want to know how different it is from AnalysisEngine? From many examples I have seen, CAS consumer is not...

One Hot Encoding for representing corpus sentences in python

python,machine-learning,nlp,scikit-learn,one-hot
I am a starter in Python and Scikit-learn library. I currently need to work on a NLP project which firstly need to represent a large corpus by One-Hot Encoding. I have read Scikit-learn's documentations about the preprocessing.OneHotEncoder, however, it seems like it is not the understanding of my term. basically,...

Stanford Parser - Factored model and PCFG

parsing,nlp,stanford-nlp,sentiment-analysis,text-analysis
What is the difference between the factored and PCFG models of stanford parser? (In terms of theoretical working and mathematical perspective)

Parsing multiple sentences with MaltParser using NLTK

java,python,parsing,nlp,nltk
There have been many MaltParser and/or NLTK related questions: Malt Parser throwing class not found exception How to use malt parser in python nltk MaltParser Not Working in Python NLTK NLTK MaltParser won't parse Dependency parser using NLTK and MaltParser Dependency Parsing using MaltParser and NLTK Parsing with MaltParser engmalt...

Data Mining and Text Mining

nlp,bigdata,nltk,data-mining,text-mining
What is the difference between Data Mining and Text Mining? Both refers to the extraction of unstructured data to structured ones. Is both forms work in the same fashion? please provide a clarity on that.

term frequency over time: how to plot +200 graphs in one plot with Python/pandas/matplotlib?

python,matplotlib,plot,nlp
I am conducting a textual content analysis of several web blogs, and now focusing on finding emerging trends. In order to do so for one blog, I coded a multi-step process: looping over all the posts, finding the top 5 keywords in each post adding them to a list, if...

How to handle slang words and short forms in Tweets like luv , kool and brb?

twitter,nlp
I am doing preprocessing of tweets using Python. However, a lot of words used are short forms of other words like luv, kool etc. And also, abbreviations like brb , ttyl etc. Right now, I can only think of having a huge Hashmap with words as keys and the actual...

Stanford Entity Recognizer (caseless) in Python Nltk

python,nlp,nltk
I am trying to figure out how to use the caseless version of the entity recognizer from NLTK. I downloaded http://nlp.stanford.edu/software/stanford-ner-2015-04-20.zip and placed it in the site-packages folder of python. Then I downloaded http://nlp.stanford.edu/software/stanford-corenlp-caseless-2015-04-20-models.jar and placed it in the folder. Then I ran this code in NLTK from nltk.tag.stanford import...

Lemmatizer supporting german language (for commercial and research purpose)

machine-learning,nlp,linguistics
I am searching for a lemmatization software which: supports the german language has a license that allows it to be used for commercial and research purpose. LGPL license would be good. should preferably be implemented in Java. Implementations in other programming languages would also be OK. Does anybody know about...

Handling count of characters with diacritics in R

r,unicode,character-encoding,nlp,linguistics
I'm trying to get the number of characters in strings with characters with diacritics, but I can't manage to get the right result. > x <- "n̥ala" > nchar(x) [1] 5 What I want to get is is 4, since n̥ should be considered one character (i.e. diacritics shouldn't be...

What is the most efficient way of storing language models in NLP applications?

nlp,n-gram,language-model
How do they usually store and update language models (such as N-gram models)? What kind of structure is the most efficient way for storing these models in databases?

define CRF++ template file

c++,perl,nlp
This is my issue, but it doesn't say HOW to define the template file correctly. My training file looks like this: 上 B-NR 海 L-NR 浦 B-NR 东 L-NR 开 B-NN 发 L-NN 与 U-CC 法 B-NN 制 L-NN 建 B-NN ... ...

PHP: Translate in natural language a weekly calendar availability?

javascript,php,algorithm,nlp
I have in my DB users' weekly availability stored like Monday Morning - yes Monday Afternoon - yes Monday Night - NO Tuesday Morning - yes Tuesday Afternoon - yes Tuesday Night - NO Wednesday Morning - yes Wednesday Afternoon - yes Wednesday Night - NO etc. basically is a...

Where can I find a corpus of search engine queries?

nlp,search-engine,google-search,bing
I'm interested in training a question-answering system on top of user-generated search queries but so far it looks like such data is not made available. Are there some research centers or industry labs that have compiled corpora of search-engine queries?

Negation handling in NLP

python,regex,nlp,nltk,text-processing
I'm currently working on a project, where I want to extract emotion from text. As I'm using conceptnet5 (a semantic network), I can't however simply prefix words in a sentence that contains a negation-word, as those words would simply not show up in conceptnet5's API. Here's an example: The movie...

How to suppress unmatched words in Stanford NER classifiers?

nlp,stanford-nlp,named-entity-recognition
I am new to Stanford NLP and NER and trying to train a custom classifier with a data sets of currencies and countries. My training data in training-data-currency.tsv looks like - USD CURRENCY GBP CURRENCY And, training data in training-data-countries.tsv looks like - USA COUNTRY UK COUNTRY And, classifiers properties...

Python : How to optimize comparison between two large sets?

python,list,optimization,comparison,nlp
I salute you ! I'm new here, and I've got a little problem trying to optimize this part of code. I'm reading from two files : Corpus.txt -----> Contains my text (of 1.000.000 words) Stop_words.txt -----> Contains my stop_list (of 4000 words) I must compare each word from my corpus...

Format an entire text with pattern.en?

python,machine-learning,nlp
I need to analyse some texts for a machine learning purpose. A data scientist I know advised me to use pattern.en for my project. I will give my program a keyword (Example : pizza), and it has to sort some "trends" from several texts I give him. (Example : I...

Annotator dependencies: UIMA Type Capabilities?

java,annotations,nlp,uima,dkpro-core
In my UIMA application, I have some annotators that must run after a certain annotator has run. At first, I thought about aggregating these annotators together, but I have other annotators that are also dependent on this (and other) annotator, which makes aggregating hard and/or unpractical. I read about Type...

How can the NamedEntityTag be used as EntityMention in RelationMention in the RelationExtractor?

nlp,stanford-nlp
I'm trying to train my own NamedEntityRecognizer and RelationExtractor. I've managed the NER model, but the integration with the RelationExtractor is a bit tricky. I get the right NamedEntityTags, but the RelationMentions found by the are only one-term and with no extra NamedEntity than the default ones. I got input...

Difference between word sense discovery and word sense induction?

nlp
What is the difference between word sense discovery and word sense induction. I have checked this Wikipedia page.It says that its strictly related to WSD. So what is the difference between them?

How to split a sentence in Python?

python-3.x,nlp
I need to isolate every single word of a long, natural text in Python3. What is the more efficient way to do this?...

Counting words in list using a dictionary

python,nlp
I have a list of dictionaries containing a word and some misspellings of the word. I am trying to go through a list of strings and first count the occurrences of the the word and then count the occurrences of each misspelling. I have tried using if word in string...

How apache UIMA is different from Apache Opennlp

nlp,opennlp,uima
I have been doing some capability testing with Apache OpenNLP, Which has the capability to Sentence detection, Tokenization, Name entity recognition. Now when i started looking at UIMA documents it is mentioned on the UIMA home page - "language identification" => "language specific segmentation" => "sentence boundary detection" => "entity...

Using word2vec to calculate similarity between users

nlp,recommendation-engine,mahout-recommender,word2vec
I recently came to know about this tool called word2vec. For my current work, I need to find out users that are similar to a given user. A single user has entities associated with it like age, qualifications, insitute/organisaions, languages known and list of certains tags. If we consider a...

Performing semantic analysis in text

nlp,semantic-web
I want to perform semantic analysis on some text similar to YAGO[1]. But I have no structure in the text to identify entities and relationships. One way is I use POS tagging and then identify subject and predicates in the sentences[2]. But still I cannot establish what relationships exist between...

CoreNLP API for N-grams?

nlp,stanford-nlp,n-gram,pos-tagger
Does CoreNLP have an API for getting unigrams, bigrams, trigrams, etc.? For example, I have a string "I have the best car ". I would love to get: I I have the the best car based on the string I am passing....

How to extract derivation rules from a bracketed parse tree?

java,parsing,recursion,nlp,pseudocode
I have a lot of parse trees like this: ( S ( NP-SBJ ( PRP I ) ) ( [email protected] ( VP ( VBP have ) ( NP ( DT a ) ( [email protected] ( NN savings ) ( NN account ) ) ) ) ( . . ) )...

dynamically populate hashmap with human language dictionary for text analysis

java,dictionary,hashmap,nlp
I'm writing a software project to take as input a text in human language and determine what language it's written in. My idea is that I'm going to store dictionaries in hashmaps, with the word as a key and a bool as a value. If the document has that word...