FAQ Database Discussion Community

How to use/call stemmer (croatian stemmer) [closed]

import re import sys...

Is it possible to get a natural word after it has been stemmed?

I have a word play which after stemming has become plai. Now I want to get play again. Is it possible? I have used Porter's Stemmer.

StanfordNLP lemmatization cannot handle -ing words

I've been experimenting with Stanford NLP toolkit and its lemmatization capabilities. I am surprised how it lemmatize some words. For example: depressing -> depressing depressed -> depressed depresses -> depress It is not able to transform depressing and depressed into the same lemma. Simmilar happens with confusing and confused, hopelessly...

UnicodeDecodeError unexpected end of data while stemming over dataset

I am new to python and I am trying to work on a small chunk of Yelp! dataset which was in JSON but I converted to CSV, using pandas libraries and NLTK. While doing preprocessing of data, I first try to remove all the punctuations and also the most common...

Why did PortStemmer in NLTK converts my “string” into u“string”

import nltk import string from nltk.corpus import stopwords from collections import Counter def get_tokens(): with open('comet_interest.xml','r') as bookmark: text=bookmark.read() lowers=text.lower() no_punctuation=lowers.translate(None,string.punctuation) tokens=nltk.word_tokenize(no_punctuation) return tokens #remove stopwords tokens=get_tokens() filtered = [w for w in tokens if not w in stopwords.words('english')] count = Counter(filtered) print count.most_common(10) #stemming from nltk.stem.porter import * def...

Terms get truncated after indexing document (Elasticsearch)

I'm new to elasticsearch, and all I did was index some documents. Then on retrieving the term vectors, I noticed that there are quite a few terms that are truncated, here is a small example "nationallypublic": { "term_freq": 1, "tokens": [ { "position": 496, "start_offset": 3126, "end_offset": 3146 } ]...

stemming words in python

I'm using this code to stem words, here is how it works, first there's a list of suffixes, the program checks if the word has the ending same as the one in the list if positive it removes the suffix, however, when I run the code I get this result:...

How to split a text into two meaningful words in R

I had a text data frame having sentences, and as I wanted the list of separate words in another dataframe I used the "qdap package" function "all_words" Words = all_words(df$problem_note_text, begins.with=NULL , alphabetical = FALSE, apostrophe.remove = TRUE, char.keep = char2space, char2space = "~~") Now have a dataframe which has...

Logical flaw: if List is null return input else print function output

In my code I call this method, as a preprocessing step to 'stem' words: public void getStem(String word) { WordnetStemmer stem = new WordnetStemmer( dict ); List<String> stemmed_words = stem.findStems(word, POS.VERB); System.out.println( stemmed_words.get(0) ); } Usually everything is good if it gets a normal word (I'm using the Java Wordnet...

Greek words stemming Lucene

Is there any way to stem single Greek words with Lucene? Do I need to index the String, or there a simpler way? I did some research and I found this link, but I don't really know how to use the Greek Stemming Filter.