FAQ Database Discussion Community


In R:How do I itearate through character strings in a loop?

r,for-loop,tm
I'm trying to access character strings from a vector in a for-loop. I have a Corpus like this one: library(tm) corpus = Corpus(VectorSource(c("cfilm,cgame,ccd","cd,film,cfilm"))) My goal is to get rid off all unnecessary "c" characters. Note, that this means I don't want to remove the c from cd, but ccd, cgame...

function tm::tm_map encounter an error

r,tm,mclapply
I have a VCorpus "oanc" and I want to change all the words to lower case, so I use the following function oanc1 <- tm_map(oanc, content_transformer(tolower)) But I got a warning: Warning message: In mclapply(content(x), FUN, ...) : scheduled cores 2 encountered errors in user code, all values of the...

what is getText function in text-mining? Where does it come from? [r]

r,twitter,text-mining,tm
I am following a text-mining example from Social Media Mining with R by Nathan Dannerman & Richard Heimann : The Book . After pulling tweets using searchTwitter function, the author uses sapply on the list to extract the text portion like this: rstats <- searchTwitter("#rstats", n = 1000) rstats_list <-...

How to reconnect to the PCorpus in the R tm package?

r,tm,corpus
I create a PCorpus, which as far as I understand is stored on HDD, with the following code: pc = PCorpus(vs, readerControl = list(language = "pl"), dbControl = list(dbName = "pcorpus", dbType = "DB1")) How may I reconnect to that database later?...

Error in using “termFreq” function in R [closed]

r,frequency,tm
I built a corpus in R by the use of tm package. I want to change the frequency boundaries and only keep the words which are repeated at least 4 times in the entire document. After that, I need to build document-term-matrix based on these terms. 'Data' is a 45k...

How to show corpus text in R tm package?

r,tm,corpus
I'm completely new in R and tm package, so please excuse my stupid question ;-) How can I show the text of a plain text corpus in R tm package? I've loaded a corpus with 323 plain text files in a corpus: src <- DirSource("Korpora/technologie") corpus <- Corpus(src) But when...

R text mining : Grouping of similar patterns from a dataframe.

r,dataframes,text-mining,names,tm
I have applied various cleaning functions from the tm package, like removing punctuation, numbers, special chars, common English words etc. and got a data-frame as shown below. Remember, I don't have a primary-key, like cust_id or account_number to rely on sno names 001 SIRIS BLACK 002 JOHN DOE 003 STEPHEN...

Document-Term-Matrix of tm Package in R

r,matrix,document,tm,term
I am using document term matrix of tm package in R. I faced an error saying: Doc <- DocumentTermMatrix(Data) Error in UseMethod("TermDocumentMatrix", x) : no applicable method for 'TermDocumentMatrix' applied to an object of class "table" I tried data frame, data table, matrix and table but I faced the error...

How to visualize the findAssocs() result from tm

r,data-visualization,text-mining,tm
I've extracted some tweets and put them into a term document matrix. Next I started looking for word associations - words which most frequently occur together. tweets_tdm <- TermDocumentMatrix(tweets_corpus) findAssocs(tweets_tdm, 'stackoverflow', 0.20) I get results which look like: programming 0.33 java 0.27 moderator 0.27 How can I visualize these results...

agrep string matching in R

r,string-matching,tm,agrep,qdap
I have two list of some product names. My problem is "Operating system" is matching with "system", "cooling system",etc. But it has to match only with "Operating","OS". Another example is "Key Board" should be matched with "key" or "KB" but not with "Mother Board" or just "Board". How to give...

Type/Token Ratio in R

r,if-statement,tm,corpus,linguistics
I'm working with a new corpus and want to get the type/token ratio. Does anyone know of a standard way to do this? I've been trawling around the internet and didn't find anything relevant. Even the tm package doesn't seem to have an easy way to do this. Just as...

R - Tokenization - single and two letter words in a TermDocumentMatrix

r,nlp,tokenize,tm
I am currently trying to do a little bit of text processing and I would like to get the one and two letter words in a TermDocumentMatrix. The issue is that it seems to display only 3 letter words and more. library(tm) library(RWeka) test<-'This is a test.' testmyCorpus<-Corpus(VectorSource(test)) testTDF<-TermDocumentMatrix(testmyCorpus, control=list(tokenize=AlphabeticTokenizer))...

big document term matrix - error when counting the number of characters of documents

r,matrix,text-mining,tm
I have built a big document-term matrix with the package RTextTools. Now I am trying to count the number of characters in the matrix rows so that I can remove empty documents before performing topic modeling. My code gives no errors when I apply it to a sample of my...

Topic modeling in R: Building topics based on a predefined list of terms

r,tm,topic-modeling
I’ve spent a couple of days working on topic models in R and I’m wondering if I could do the following: I would like R to build topics based on a predefined termlist with specific terms. I already worked with this list to identify ngrams (RWeka) in documents and count...

R: TM package Finding Word Frequency from a Single Column

r,tm,qdap
I've recently been working on trying to find the word frequency within a single column in a data.frame in R using the tm package. While the data.frame itself has many columns that are both numeric and character based, I'm only interested in a single column that is pure text. While...

How to speed up R code

r,loops,tm
The code below works fine to remove the stopwords in myCharVector. But when the myCharVector has large number of sentences, it takes too long time to complete. How to speed up the loop operation (using apply)? Thanks. library(tm) myCharVector <- c("This is the first sentence", "hello this is second", "and...

LDA with tm package in R using bigrams

r,text-mining,tm,tf-idf,lda
I have a csv with every row as a document. I need to perform LDA upon this. I have the following code : library(tm) library(SnowballC) library(topicmodels) library(RWeka) X = read.csv('doc.csv',sep=",",quote="\"",stringsAsFactors=FALSE) corpus <- Corpus(VectorSource(X)) corpus <- tm_map(tm_map(tm_map(corpus, stripWhitespace), tolower), stemDocument) corpus <- tm_map(corpus, PlainTextDocument) BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2,...

Splitting a document from a tm Corpus into multiple documents

regex,r,split,tm,text-analysis
A bit of a bizarre question, is there a way to split corpus documents that have been imported using the Corpus function in tm into multiple documents that can then be reread in my Corpus as separate documents? For example if I used inspect(documents[1]) and had something like `<<VCorpus (documents:...

Can't plot Zipf's law in R

r,distribution,tm,qdap
I have a big list of terms and their frequency loaded from a text file and I converted it to a table: myTbl = read.table("word_count.txt") # read text file colnames(myTbl)<-c("term", "frequency") head(myTbl, n = 10) > head(myTbl, n = 10) term frequency 1 de 35945 2 i 34850 3 \xe3n...

R: tm: TextDocument level metadata setting looks to be very inefficient

r,metadata,text-mining,tm,corpus
I'm loading text documents from the database, then I create corpus from them, and finally I set prefixed id of the document (I need to use prefix, since I've got documents of several types). rs <- dbSendQuery(con,"SELECT id::TEXT, content FROM entry") entry.d = data.table(fetch(rs,n=-1)) entry.vs = VectorSource(entry.d$content) entry.vc = VCorpus(entry.vs,...

Text mining and NLP: from R to Python

python,r,nltk,text-mining,tm
First of all, saying that I am new to python. At the moment, I am "translating" a lot of R code into python and learning along the way. This question relates to this one replicating R in Python (in there they actually suggest to wrap it up using rpy2, which...

My DocumnetTermMatrix reduces to Zero columns

r,text-mining,tm,term-document-matrix
train <- read.delim('train.tsv', header= T, fileEncoding= "windows-1252",stringsAsFactors=F) Train.tsv contains 1,56,060 lines of text with 4 column names Phrase, PhraseID, SentenceID and Sentiment(on scale of 0 to 4).Phrase column has the text lines. (Tm package already loaded) R Version: 3.1.2 ; OS: Windows 7, 64 bit, 4 GB RAM. > dput(head(train,6))...

Can't Inspect Text Corpus in R

r,text,text-mining,tm
I am trying to create Corpus for further analysis, the code I am showing You suddenly stopped working and I cannot find solution for this error. I execute this: library("tm") library("SnowballC") library("wordcloud") library("arules") library("arulesViz") #library("e1071") #WCZYTAJ_DANE###################################################################### setwd("D:/Dysk Google/Shared/SGGW/MGR_R2/Metody Eksploracji Danych/_PROJEKT") smSPAM <- read.table("smSPAM.txt", sep="\t", quote="", stringsAsFactors = F) dim(smSPAM) colnames(smSPAM)...

Extract and count common word-pairs from character vector

r,regex-lookarounds,tm,qdap
How can someone find frequent pairs of adjacent words in a character vector? Using the crude data set, for example, some common pairs are "crude oil", "oil market", and "million barrels". The code for the small example below tries to identify frequent terms and then, using a positive lookahead assertion,...