FAQ Database Discussion Community

How to split a text into two meaningful words in R

I had a text data frame having sentences, and as I wanted the list of separate words in another dataframe I used the "qdap package" function "all_words" Words = all_words(df$problem_note_text, begins.with=NULL , alphabetical = FALSE, apostrophe.remove = TRUE, char.keep = char2space, char2space = "~~") Now have a dataframe which has...

Splitting a document from a tm Corpus into multiple documents

A bit of a bizarre question, is there a way to split corpus documents that have been imported using the Corpus function in tm into multiple documents that can then be reread in my Corpus as separate documents? For example if I used inspect(documents[1]) and had something like `<<VCorpus (documents:...

split by elements of a string, and create a dictionary with {element used to split: that chunk of text}

Consider the following text: "Mr. McCONNELL. yadda yadda jon stewart is mean to me. The PRESIDING OFFICER. Suck it up. Mr. McCONNELL. but noooo. Mr. REID. Really dude?" And a list of words to split on: ["McCONNELL", "PRESIDING OFFICER", "REID"] I want to have the output be the dictionary {"McCONNELL":...

Transporting Sparse Matrix from Python to R

I am doing some text analysis work in Python. Unfortunately, I need to switch to R in order to use a particular package (unfortunately, the package cannot be replicated in Python easily). Currently the text is parsed into bigram counts, reduced to a vocabulary of about 11,000 bigrams, and then...

Favorite tool for word/phrase counting

I am looking for a tool that performs counting of words and, more importantly, phrases, in large amounts of open-ended text responses. I need the ability to exclude certain words (a, the, and, etc.) as well. I am aware of a few tools that do this: - http://www.mywritertools.com/default.asp - http://www.hermetic.ch/wfca/wfca.htm...

Cutting down on Stanford parser's time-to-parse by pruning the sentence

We are already aware that the parsing time of Stanford Parser increases as the length of a sentence increases. I am interested in finding creative ways in which we prune the sentence such that the parsing time decreases without compromising on accuracy. For e.g. we can replace known noun phrases...

Identifying sections tabbed in from raw text

Consider the text on this page. If you look at the source code, you'll see that the main text is presented exactly as in the page -- no HTML divisions or any other way to obviously find paragraphs/tabbed in sections. Is there a way to automatically identify and remove sections...

Retrieve code executed by function in Java

I'm trying to analyse some bits of Java-code, looking if the code is written too complexly. I start with a String containing the contents of a Java-class. From there I want to retrieve, given a function-name, the "inner code" by that function. In this example: public class testClass{ public int...

Stanford Parser - Factored model and PCFG

What is the difference between the factored and PCFG models of stanford parser? (In terms of theoretical working and mathematical perspective)

Performing Text Analytics on a text Column in Dataframe in R [closed]

I have imported a CSV file into a dataframe in R and one of the columns contains Text. I want to perform analysis on the text. How do I go about it? I tried making a new dataframe containing only the text column. OnlyTXT= Txtanalytics1 %>% select(problem_note_text) View(OnlyTXT). ...