FAQ Database Discussion Community

stemming words in python

I'm using this code to stem words, here is how it works, first there's a list of suffixes, the program checks if the word has the ending same as the one in the list if positive it removes the suffix, however, when I run the code I get this result:...

Greek words stemming Lucene

Is there any way to stem single Greek words with Lucene? Do I need to index the String, or there a simpler way? I did some research and I found this link, but I don't really know how to use the Greek Stemming Filter.

How to split a text into two meaningful words in R

I had a text data frame having sentences, and as I wanted the list of separate words in another dataframe I used the "qdap package" function "all_words" Words = all_words(df$problem_note_text, begins.with=NULL , alphabetical = FALSE, apostrophe.remove = TRUE, char.keep = char2space, char2space = "~~") Now have a dataframe which has...

Is it possible to get a natural word after it has been stemmed?

I have a word play which after stemming has become plai. Now I want to get play again. Is it possible? I have used Porter's Stemmer.

Logical flaw: if List is null return input else print function output

In my code I call this method, as a preprocessing step to 'stem' words: public void getStem(String word) { WordnetStemmer stem = new WordnetStemmer( dict ); List<String> stemmed_words = stem.findStems(word, POS.VERB); System.out.println( stemmed_words.get(0) ); } Usually everything is good if it gets a normal word (I'm using the Java Wordnet...

How to use/call stemmer (croatian stemmer) [closed]

import re import sys...

Why did PortStemmer in NLTK converts my “string” into u“string”

import nltk import string from nltk.corpus import stopwords from collections import Counter def get_tokens(): with open('comet_interest.xml','r') as bookmark: text=bookmark.read() lowers=text.lower() no_punctuation=lowers.translate(None,string.punctuation) tokens=nltk.word_tokenize(no_punctuation) return tokens #remove stopwords tokens=get_tokens() filtered = [w for w in tokens if not w in stopwords.words('english')] count = Counter(filtered) print count.most_common(10) #stemming from nltk.stem.porter import * def...

UnicodeDecodeError unexpected end of data while stemming over dataset

I am new to python and I am trying to work on a small chunk of Yelp! dataset which was in JSON but I converted to CSV, using pandas libraries and NLTK. While doing preprocessing of data, I first try to remove all the punctuations and also the most common...

Terms get truncated after indexing document (Elasticsearch)

I'm new to elasticsearch, and all I did was index some documents. Then on retrieving the term vectors, I noticed that there are quite a few terms that are truncated, here is a small example "nationallypublic": { "term_freq": 1, "tokens": [ { "position": 496, "start_offset": 3126, "end_offset": 3146 } ]...

StanfordNLP lemmatization cannot handle -ing words

I've been experimenting with Stanford NLP toolkit and its lemmatization capabilities. I am surprised how it lemmatize some words. For example: depressing -> depressing depressed -> depressed depresses -> depress It is not able to transform depressing and depressed into the same lemma. Simmilar happens with confusing and confused, hopelessly...