FAQ Database Discussion Community


Greek words stemming Lucene

lucene,stemming
Is there any way to stem single Greek words with Lucene? Do I need to index the String, or there a simpler way? I did some research and I found this link, but I don't really know how to use the Greek Stemming Filter.

Why did PortStemmer in NLTK converts my “string” into u“string”

python,nltk,sax,stemming
import nltk import string from nltk.corpus import stopwords from collections import Counter def get_tokens(): with open('comet_interest.xml','r') as bookmark: text=bookmark.read() lowers=text.lower() no_punctuation=lowers.translate(None,string.punctuation) tokens=nltk.word_tokenize(no_punctuation) return tokens #remove stopwords tokens=get_tokens() filtered = [w for w in tokens if not w in stopwords.words('english')] count = Counter(filtered) print count.most_common(10) #stemming from nltk.stem.porter import * def...

Logical flaw: if List is null return input else print function output

java,null,wordnet,stemming,data-processing
In my code I call this method, as a preprocessing step to 'stem' words: public void getStem(String word) { WordnetStemmer stem = new WordnetStemmer( dict ); List<String> stemmed_words = stem.findStems(word, POS.VERB); System.out.println( stemmed_words.get(0) ); } Usually everything is good if it gets a normal word (I'm using the Java Wordnet...

StanfordNLP lemmatization cannot handle -ing words

java,nlp,stanford-nlp,stemming,lemmatization
I've been experimenting with Stanford NLP toolkit and its lemmatization capabilities. I am surprised how it lemmatize some words. For example: depressing -> depressing depressed -> depressed depresses -> depress It is not able to transform depressing and depressed into the same lemma. Simmilar happens with confusing and confused, hopelessly...

stemming words in python

python,nlp,stemming
I'm using this code to stem words, here is how it works, first there's a list of suffixes, the program checks if the word has the ending same as the one in the list if positive it removes the suffix, however, when I run the code I get this result:...

Is it possible to get a natural word after it has been stemmed?

nlp,stemming,porter-stemmer
I have a word play which after stemming has become plai. Now I want to get play again. Is it possible? I have used Porter's Stemmer.

How to use/call stemmer (croatian stemmer) [closed]

python,call,stemming
import re import sys...

UnicodeDecodeError unexpected end of data while stemming over dataset

python,unicode,pandas,nltk,stemming
I am new to python and I am trying to work on a small chunk of Yelp! dataset which was in JSON but I converted to CSV, using pandas libraries and NLTK. While doing preprocessing of data, I first try to remove all the punctuations and also the most common...

How to split a text into two meaningful words in R

r,string-split,stemming,text-analysis
I had a text data frame having sentences, and as I wanted the list of separate words in another dataframe I used the "qdap package" function "all_words" Words = all_words(df$problem_note_text, begins.with=NULL , alphabetical = FALSE, apostrophe.remove = TRUE, char.keep = char2space, char2space = "~~") Now have a dataframe which has...

Terms get truncated after indexing document (Elasticsearch)

elasticsearch,stemming
I'm new to elasticsearch, and all I did was index some documents. Then on retrieving the term vectors, I noticed that there are quite a few terms that are truncated, here is a small example "nationallypublic": { "term_freq": 1, "tokens": [ { "position": 496, "start_offset": 3126, "end_offset": 3146 } ]...