machine-learning,nlp,linguistics , Lemmatizer supporting german language (for commercial and research purpose)

Lemmatizer supporting german language (for commercial and research purpose)


Tag: machine-learning,nlp,linguistics

I am searching for a lemmatization software which:

Does anybody know about such a lemmatizer?



LanguageTool can do that (disclaimer: I'm the maintainer of LanguageTool), it's available under LGPL and implemented in Java. You could use GermanTagger.tag(), the result can have more than one reading (as language is often ambiguous), and each reading's AnalyzedToken finally has a lemma.


Dimension Reduction of Feature in Machine Learning

Is there any way to reduce the dimension of the following features from 2D coordinate (x,y) to one dimension? ...

Annotator dependencies: UIMA Type Capabilities?

In my UIMA application, I have some annotators that must run after a certain annotator has run. At first, I thought about aggregating these annotators together, but I have other annotators that are also dependent on this (and other) annotator, which makes aggregating hard and/or unpractical. I read about Type...

Separately tokenizing and pos-tagging with CoreNLP

I'm having few problems with the way Stanford CoreNLP divides text into sentences, namely: It treats ! and ? (exclamation and question marks) inside a quoted text as a sentence end where it shouldn't, e.g.: He shouted "Alice! Alice!" - here it treats the ! after the first Alice as...

Stanford Entity Recognizer (caseless) in Python Nltk

I am trying to figure out how to use the caseless version of the entity recognizer from NLTK. I downloaded and placed it in the site-packages folder of python. Then I downloaded and placed it in the folder. Then I ran this code in NLTK from nltk.tag.stanford import...

How avoid error “TypeError: invalid data type for einsum” in Python

I try to load CSV file to numpy-array and use the array in LogisticRegression etc. Now, I am struggling with error is shown below: import numpy as np import pandas as pd from sklearn import preprocessing from sklearn.linear_model import LogisticRegression dataset = pd.read_csv('../Bookie_test.csv').values X = dataset[1:, 32:34] y = dataset[1:,...

Prediction based on large texts using Vowpal Webbit

I want to use the resolution time in minutes and the client description of the tickets on Zendesk to predict the resolution time of next tickets based on their description. I will use only this two values, but the description is a large text. I searched about hashing the feature...

Save and reuse TfidfVectorizer in scikit learn

I am using TfidfVectorizer in scikit learn to create a matrix from text data. Now I need to save this object for reusing it later. I tried to use pickle, but it gave the following error. loc=open('vectorizer.obj','w') pickle.dump(self.vectorizer,loc) *** TypeError: can't pickle instancemethod objects I tried using joblib in sklearn.externals,...

What exactly is the difference between AnalysisEngine and CAS Consumer?

I'm learning UIMA, and I can create basic analysis engines and get results. But What I'm finding it difficult to understand is use of CAS consumers. At the same time I want to know how different it is from AnalysisEngine? From many examples I have seen, CAS consumer is not...

Using Python to find correlation pairs

NAME PRICE SALES VIEWS AVG_RATING VOTES COMMENTS Module 1 $12.00 69 12048 5 3 26 Module 2 $24.99 12 52858 5 1 14 Module 3 $10.00 1 1381 -1 0 0 Module 4 $22.99 46 57841 5 8 24 ................. So, Let's say I have statistics of sales. I...

How configure Stanford QNMinimizer to get similar results as scipy.optimize.minimize L-BFGS-B

I want to configurate the QN-Minimizer from Stanford Core NLP Lib to get nearly similar optimization results as scipy optimize L-BFGS-B implementation or get a standard L-BFSG configuration that is suitable for the most things. I set the standard paramters as follow: The python example I want to copy: scipy.optimize.minimize(neuralNetworkCost,...

Extract Patterns from the device log data

I am working on a project, in which we have to extract the patterns(User behavior) from the device log data. Device log contains different device actions with a timestamp like when the devices was switched on or when they was switched off. For example: When a person enters a room....

How to select only complete in a panda data.frame

I have the following data-set on python import pandas as pd bcw = pd.read_csv('', header=None) Lines like 24 have missing values: 1057013,8,4,5,1,2,?,7,3,1,4 On column 7, there is a '?', and I want to drop this line. How can I achieve this? ...

What is the default behavior of Stanford NLP's WordsToSentencesAnnotator when splitting a text into sentences?

Looking at, DEFAULT_BOUNDARY_REGEX = "\\.|[!?]+"; led me to think that the text would get split into sentences based on ., ! and ?. However, if I pass the string D R E L I N. Okay. as input, e.g. using the command line interface: java -cp "*" -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP...

term frequency over time: how to plot +200 graphs in one plot with Python/pandas/matplotlib?

I am conducting a textual content analysis of several web blogs, and now focusing on finding emerging trends. In order to do so for one blog, I coded a multi-step process: looping over all the posts, finding the top 5 keywords in each post adding them to a list, if...

Normalize a feature in this table

This has become quite a frustrating question, but I've asked in the Coursera discussions and they won't help. Below is the question: I've gotten it wrong 6 times now. How do I normalize the feature? Hints are all I'm asking for. I'm assuming x_2^(2) is the value 5184, unless I...

Abbreviation Reference for NLTK Parts of Speach

I'm using nltk to find the parts of speech for each word in a sentence. It returns abbreviations that I both can't fully intuit and can't find good documentation for. Running: import nltk sample = "There is no spoon." tokenized_words = nltk.word_tokenize(sample) tagged_words = nltk.pos_tag(tokenized_words) print tagged_words Returns: [('There', 'EX'),...

POS of WSJ in CONLL format from penn tree bank

I've got the penn tree bank CD. How to convert designated WSJ documents to conll format? Because the original format is in tree structure. E.g. The conll shared task 2000: is using treebank. How was this format obtained? Thank you!

Are word-vector orientations universal?

I have recently been experimenting with Word2Vec and I noticed whilst trawling through forums that a lot of other people are also creating their own vectors from their own databases. This has made me curious as to how vectors look across databases and whether vectors take a universal orientation? I...

Opencv mlp Same Data Different Results

Let Me simplify this question. If I run opencv MLP train and classify consecutively on the same data, I get different results. Meaning, if I put training a new mlp on the same train data and classifying on the same test data in a for loop, each iteration will give...

How to extract derivation rules from a bracketed parse tree?

I have a lot of parse trees like this: ( S ( NP-SBJ ( PRP I ) ) ( [email protected] ( VP ( VBP have ) ( NP ( DT a ) ( [email protected] ( NN savings ) ( NN account ) ) ) ) ( . . ) )...

How do I get the raw predictions (-r) from Vowpal Wabbit when running in daemon mode?

Using the below, I'm able to get both the raw predictions and the final predictions as a file: cat train.vw.txt | vw -c -k --passes 30 --ngram 5 -b 28 --l1 0.00000001 --l2 0.0000001 --loss_function=logistic -f model.vw --compressed --oaa 3 cat test.vw.txt | vw -t -i model.vw --link=logistic -r raw.txt...

Using the predict_proba() function of RandomForestClassifier in the safe and right way

I'm using Scikit-learn to apply machine learning algorithm on my datasets. Sometimes I need to have the probabilities of labels/classes instated of the labels/classes themselves. Instead of having Spam/Not Spam as labels of emails, I wish to have only for example: 0.78 probability a given email is Spam. For such...

Nominal valued dataset in machine learning

What's the best way to use nominal value as opposed to real or boolean ones for being included in a subset of feature vector for machine learning? Should I map each nominal value to real value? For example, if I want to make my program to learn a predictive model...

How to specify the prior probability for scikit-learn's Naive Bayes

I'm using the scikit-learn machine learning library (Python) for a machine learning project. One of the algorithms I'm using is the Gaussian Naive Bayes implementation. One of the attributes of the GaussianNB() function is the following: class_prior_ : array, shape (n_classes,) I want to alter the class prior manually since...

Create Dictionary from Penn Treebank Corpus sample from NLTK?

I know that the Treebank corpus is already tagged, but unlike the Brown corpus, I can't figure out how to get a dictionary of tags. For instance, >>> from nltk.corpus import brown >>> wordcounts = nltk.ConditionalFreqDist(brown.tagged_words()) This doesn't work on the Treebank corpus?...

Why can't I calculate CostFunction J

This is my implementation of CostFunctionJ: function J = CostFunctionJ(X,y,theta) m = size(X,1); predictions = X*theta; sqrErrors =(predictions - y).^2; J = 1/(2*m)* sum(sqrErrors); But when I try to enter the command in MATLAB as: >> X = [1 1; 1 2; 1 3]; >> y = [1; 2; 3];...

Is it Item based or content based Collaborative filtering?

I am currently working on an existing system that recommends items that are similar to previous items that the user has liked. It uses Alternating least squares Collaborative Filtering to find feature vectors of users and items. It then uses the feature vectors of the items and uses the cosine...

How to cluster a set of strings?

My dataset looks something like this ['', 'ABCDH', '', '', 'H', 'HHIH', '', '', '', '', '', '', '', '', '', '', '', 'FECABDAI', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'FABHJJFFFFEEFGEE', 'FFFF', '', '', '', '', '', '', '',...

Which classifiers provide weight vector?

What machine learning classifiers exists which provide after the learning phase a weight vector? I know about SVM, logistic regression, perceptron and LDA. Are there more? My goal is to use these weight vector to draw an importance map....

What is rank in ALS machine Learning Algorithm in Apache Spark Mllib

I Wanted to try an example of ALS machine learning algorithm. And my code works fine, However I do not understand parameter rank used in algorithm. I have following code in java // Build the recommendation model using ALS int rank = 10; int numIterations = 10; MatrixFactorizationModel model =...

Chinese sentence segmenter with Stanford coreNLP

I'm using the Stanford coreNLP system with the following command: java -cp stanford-corenlp-3.5.2.jar:stanford-chinese-corenlp-2015-04-20-models.jar -Xmx3g edu.stanford.nlp.pipeline.StanfordCoreNLP -props -annotators segment,ssplit -file input.txt And this is working great on small chinese texts. However, I need to train a MT system which just requires me to segment my input. So I just need...

Does Andrew Ng's ANN from Coursera use SGD or batch learning?

What type of learning is Andrew Ng using in his neural network excercise on Coursera? Is it stochastic gradient descent or batch learning? I'm a little confused right now......

How to interpret scikit's learn confusion matrix and classification report?

I have a sentiment analysis task, for this Im using this corpus the opinions have 5 classes (very neg, neg, neu, pos, very pos), from 1 to 5. So I do the classification as follows: from sklearn.feature_extraction.text import TfidfVectorizer import numpy as np tfidf_vect= TfidfVectorizer(use_idf=True, smooth_idf=True, sublinear_tf=False, ngram_range=(2,2)) from sklearn.cross_validation...

Python NLTK - Making a 'Dictionary' from a Corpus and Saving the Number Tags

I'm not super experienced with Python, but I want to do some Data analytics with a corpus, so I'm doing that part in NLTK Python. I want to go through the entire corpus and make a dictionary containing every word that appears in the corpus dataset. I want to be...

Why is there only one hidden layer in a neural network?

I recently made my first neural network simulation which also uses a genetic evolution algorithm. It's simple software that just simulates simple organisms collecting food, and they evolve, as one would expect, from organisms with random and sporadic movements into organisms with controlled, food-seeking movements. Since this kind of organism...

Basic Machine Learning: Linear Regression and Gradient Descent

I'm taking Andrew Ng's ML class on Coursera and am a bit confused on gradient descent. The screenshot of the formula I'm confused by is here: In his second formula, why does he multiply by the value of the ith training example? I thought when you updated you were just...

Coreference resolution using Stanford CoreNLP

I am new to the Stanford CoreNLP toolkit and trying to use it for a project to resolve coreferences in news texts. In order to use the Stanford CoreNLP coreference system, we would usually create a pipeline, which requires tokenization, sentence splitting, part-of-speech tagging, lemmarization, named entity recoginition and parsing....

Python NLTK pos_tag not returning the correct part-of-speech tag

Having this: text = word_tokenize("The quick brown fox jumps over the lazy dog") And running: nltk.pos_tag(text) I get: [('The', 'DT'), ('quick', 'NN'), ('brown', 'NN'), ('fox', 'NN'), ('jumps', 'NNS'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'NN'), ('dog', 'NN')] This is incorrect. The tags for quick brown lazy in the sentence should be:...

How to find algo type(regression,classification) in Caret in R for all algos at once?

How do I find whether model type for all models at once? I know how to access this info if I know the algo name, e.g.: library('Caret') tail(name(getModelInfo())) [1] "widekernelpls" "WM" "wsrf" "xgbLinear" "xgbTree" [6] "xyf" getModelInfo()$xyf$type [1] "Classification" "Regression" How do I see the $type for all the algos...

Stanford Parser - Factored model and PCFG

What is the difference between the factored and PCFG models of stanford parser? (In terms of theoretical working and mathematical perspective)

Amazon Machine Learning for sentiment analysis

How flexible or supportive is the Amazon Machine Learning platform for sentiment analysis and text analytics?

Enforcing that inputs sum to 1 and are contained in the unit interval in scikit-learn

I have three inputs: x=(A, B, C); and an output y. It needs to be the case that A+B+C=1 and 0<=A<=1, 0<=B<=1, 0<=C<=1. I want to find the x that maximizes y. My approach is to use a regression routine in scikit-learn to train a model f on my inputs...

Spectral clustering with Similarity matrix constructed by jaccard coefficient

I have a categorical dataset, I am performing spectral clustering on it. But I do not get very good output. I choose the eigen vectors corresponding to largest eigen values as my centroids for k-means. Please find below the process I follow: 1. Create a symmetric similarity matrix (m*m) using...

Which spark MLIB algorithm to use?

I'm newbie to machine learning and would like to understand what algorithm (Classification algorithm or co-relation algorithm?) to use in order to understand what is the relationship between one or more attributes. for example consider I have following set of attributes, Bill No, Bill Amount, Tip amount, Waiter Name and...

Matlab: How can I store the output of “fitcecoc” in a database

In Matlab help section, there's a very helpful example to solve classification problems under "Digit Classification Using HOG Features". You can easily execute the full script by clikcing on 'Open this example'. However, I'm wondering if there's a way to store the output of "fitcecoc" in a database so you...

Why does classifier accuracy drop after PCA, even though 99% of the total variance is covered?

I have a 500x1000 feature vector and principal component analysis says that over 99% of total variance is covered by the first component. So I replace 1000 dimension point by 1 dimension point giving 500x1 feature vector(using Matlab's pca function). But, my classifier accuracy which was initially around 80% with...

how to programmatically create ensembles in weka?

Does there already exist a class in weka that takes care of voting/averaging different models, or do I have to come up with my own scheme? I already looked for that kind of functionality on the web, but I couldn't find any specific information....