FAQ Database Discussion Community


Conditional jump or move depends on uninitialised value(s) strcat

c,racket,tokenize
I understand that this valgrind error is occurred because I was trying to use something uninitialized. The code below is one that causes this error. What it's doing is it is trying to read Racket code and get each symbols such as + or define. (tokenize) I am not expecting...

Tokenized output of C source code

c,compiler-construction,tokenize
I want to look at the tokenized output my c-source code. The cpp processor first process the cpp-directives and then it tokenizes the c source code. Then the this tokenized output is parsed. After that assembler does the job and process continues. I have written my tokenizer using flex. I...

Using python rdflib parsers without the graph object

python,parsing,tokenize,rdflib
Loading RDF data in Python looks like this: from rdflib import Graph g = Graph() g.parse("demo.nt", format="nt") But what about using the format parsers standalone as streaming parsers, getting a stream of parsed tokens? Can someone give me a hint/code example?...

CountVectorizer in sklearn with only words above some minimum number of occurrences

python,text,scikit-learn,tokenize
I am using sklearn to train a logistic regression on some text data, by using CountVectorizer to tokenize the data into bigrams. I use a line of code like the one below: vect= CountVectorizer(ngram_range=(1,2), binary =True) However, I'd like to limit myself to only including bigrams in my resultant sparse...

nltk sentence tokenizer, consider new lines as sentence boundary

python,nlp,nltk,tokenize
I am using nltk's PunkSentenceTokenizer to tokenize a text to a set of sentences. However, the tokenizer doesn't seem to consider new paragraph or new lines as a new sentence. >>> from nltk.tokenize.punkt import PunktSentenceTokenizer >>> tokenizer = PunktSentenceTokenizer() >>> tokenizer.tokenize('Sentence 1 \n Sentence 2. Sentence 3.') ['Sentence 1 \n...

How to make a program work on different text files in C

c,performance,tokenize
I am trying to work with command line arguments and parsing a text file in C. Basically I want to be able to put in two numbers, like, 1 and 4 and have it read a column of a text file then print it to stdout. I want to be...

Why is my vector empty?

c++,c++11,tokenize
I want to create a simple inverted index. I have a file with with docIds and keywords that are in each document. So the first step is to try and read the file and tokenize the text file. I found a tokenize function online that was supposed to work and...

StanfordNLP Tokenizer

tokenize,stanford-nlp,misspelling
I use StanfordNLP to tokenize a set of messages written with smartphones. These texts have a lot of typos and do not respect the punctuation rules. Very often the blank spaces are missing affecting the tokenization. For instance, the following sentences miss the blankspace in "California.This" and "university,founded". Stanford University...

NLTK PunktSentenceTokenizer ellipsis splitting

python,python-2.7,nltk,tokenize
I'm working with NLTK PunktSentenceTokenizer and I'm facing a situation where the a text containing multiple sentences separated by the ellipsis character (...). Here is the example I'm working on: >>> from nltk.tokenize import PunktSentenceTokenizer >>> pst = PunktSentenceTokenizer() >>> pst.sentences_from_text("Horrible customer service... Cashier was rude... Drive thru took hours......

word_delimiter with split_on_numerics removes all tokens

elasticsearch,tokenize
When analyzing alpha 1a beta, I want the outcome of tokens to be [alpha 1 a beta]. Why does myAnalyzer not do the trick? POST myindex { "settings" : { "analysis" : { "analyzer" : { "myAnalyzer" : { "type" : "custom", "tokenizer" : "standard", "filter" : [ "split_on_numerics" ]...

C++ Tokenize String - Not Working

c++,string,tokenize
I am having trouble tokenizing a string in order to add the substrings to vectors in an iterative loop. I have this below. When I run it, I am getting a return value of 1 from this function call, which I'm pretty sure is not accurate. serialized.find_first_of(categoryDelim, outterPrev) Code void...

Tokenizer plugin with autocomplete. Do I implement it properly?

jquery,autocomplete,tokenize
I am using the plugin found here - http://www.jqueryscript.net/form/Simple-jQuery-Tagging-Tokenizer-Input-with-Autocomplete-Tokens.html github - https://github.com/firstandthird/tokens dependecies - https://github.com/jgallen23/fidel Example uses: 1) Working example (source array is loaded in memory) - http://jsfiddle.net/george_black/brmbyL8x/ js code: (function () { $('#tokens-example').tokens({ source: ['Acura', 'Audi', 'BMW', 'Cadillac', 'Chrysler', 'Dodge', 'Ferrari', 'Ford', 'GMC', 'Honda', 'Hyundai', 'Infiniti', 'Jeep', 'Kia', 'Lexus',...

Java - Tokenizing by regex

java,regex,tokenize,stringtokenizer
Im trying to tokenize strings of the following format: "98, BA71V-CP204L (p32, p30), BA71V-CP204L (p32, p30), , 0, 125900, 126505" "91, BA71V-B175L, BA71V-B175L, , 0, 108467, 108994, -, 528, 528" Each of the tokens will then be stored in a string array. The strings are to be tokenized by ","...

Java StreamTokenizer splits Email adress at @ sign

java,email,stream,tokenize
I am trying to parse a document containing email adresses, but the StreamTokenizer splits the E-mail adress into two seperate parts. I already set the @ sign as an ordinaryChar and space as the only whitespace: StreamTokenizer tokeziner = new StreamTokenizer(freader); tokeziner.ordinaryChar('@'); tokeziner.whitespaceChars(' ', ' '); Still, all E-mail adresses...

Efficient Way to Tokenize String With Complex Delimiter / Separator and Preserving the Delimiter / Separator as Token in C#

c#,string,tokenize
I am trying to find the most efficient way to create a generic tokenizer that will retain the complex delimiters / separators as extra token. And yes... I looked at some SO questions like How can i use string#split to split a string with the delimiters + - * /...

Only the first character of token is stored in array in c++

c++,arrays,tokenize
int main(void) { char *text = (char*)malloc ( 100 *sizeof( char)); cout << "Enter the first arrangement of data." << endl; cin.getline(text, 100); char *token = strtok(text, " "); char *data = (char*)malloc ( 100*sizeof( char)); while ( token != NULL ) { if (strlen(token) > 0) { cout...

elasticsearch custom tokenizer - split token by length

elasticsearch,lucene,tokenize,stringtokenizer,analyzer
I am using elasticsearch version 1.2.1. I have a use case in which I would like to create a custom tokenizer that will break the tokens by their length up to a certain minimum length. For example, assuming minimum length is 4, the token "abcdefghij" will be split into: "abcd...

ElasticSearch search for special characters with pattern analyzer

elasticsearch,tokenize,query-analyzer
I'm currently using a custom analyzer with the tokenizer set to be the pattern (\W|_)+ So so each term is only letters and split on any non letter. As an example I have a document with the contents [dbo].[Material_Get] and another with dbo.Another_Material_Get. I want to be able to search...

String Tokenizer (Double Quotes and Whitespace)

java,string,tokenize
I am trying to implement a way for taking in arguments for a photo album that I am building. However, I am having a hard time figuring out how to tokenize the input. Two sample inputs: addPhoto "DSC_017.jpg" "DSC_017" "Fall colors" addPhoto "DSC_018.jpg" "DSC_018" "Colorado Springs" I would like this...

StreamReader row and line delimiters

c#,arrays,tokenize,delimiter,streamreader
I am trying to figure out how to tokenize a StreamReader of a text file. I have been able to separate the lines, but now I am trying to figure out how to break down those lines by a tab delimiter as well. This is what I have so far....

Tokenizing a String - C

c,tokenize
I'm trying to tokenize a string in C based upon \r\n delimiters, and want to print out each string after subsequent calls to strtok(). In a while loop I have, there is processing done to each token. When I include the processing code, the only output I receive is the...

What is an efficient data structure for tokenized data in Python?

python,performance,text,pandas,tokenize
I have a pandas dataframe that has a column with some text. I want to modify the dataframe such that there is a column for every distinct word that occurs across all rows, and a boolean indicating whether or not that word occurs in that particular row's value in my...

How to remove a custom word pattern from a text using NLTK with Python

python,regex,nlp,nltk,tokenize
I am currently working on a project of analyzing the quality examination paper questions.In here I am using Python 3.4 with NLTK. So first I want to take out each question separately from the text.The question paper format is given below. (Q1). What is web 3.0? (Q2). Explain about blogs....

R - Tokenization - single and two letter words in a TermDocumentMatrix

r,nlp,tokenize,tm
I am currently trying to do a little bit of text processing and I would like to get the one and two letter words in a TermDocumentMatrix. The issue is that it seems to display only 3 letter words and more. library(tm) library(RWeka) test<-'This is a test.' testmyCorpus<-Corpus(VectorSource(test)) testTDF<-TermDocumentMatrix(testmyCorpus, control=list(tokenize=AlphabeticTokenizer))...

how to write lex file for input like “{\”a\“:1,\”b\“:2}”

json,parsing,racket,tokenize,lex
I want to implement a json parser, but having problem with parse object like "{\"a\":1,\"b\":2}", currently the parser output somthing like this '(json (object "{" (kvpair "\"a\":1,\"b\"" ":" (json (number "2"))) "}")) but what i actually want is '(json (object "{" (kvpair "\"a\"" ":" (json (number "1"))) "," (kvpair "\"b\""...

Best way to parse custom Filtersyntax

c#,parsing,filter,tokenize
I have a program which allows the user to enter a filter in a textbox in the column header of a DataGridView. This text is then parsed into a list of FilterOperations. Currently i tokenize the string and then build the list in a hunge For-loop. Which Desing Patterns could...

StanfordNLP Spanish Tokenizer

tokenize,stanford-nlp
I want to tokenize a text in Spanish with StanfordNLP and my problem is that the model splits any word matching the pattern "\d*s " (a word composed by digits and ending with an "s") in two tokens. If the word finished with another letter, such as "e", the tokenizer...

Elasticsearch “pattern_replace”, replacing whitespaces while analyzing

elasticsearch,whitespace,tokenize,removing-whitespace
Basically I want to remove all whitespaces and tokenize the whole string as a single token. (I will use nGram on top of that later on.) This is my index settings: "settings": { "index": { "analysis": { "filter": { "whitespace_remove": { "type": "pattern_replace", "pattern": " ", "replacement": "" } },...

Elastic Search : Configuring icu_tokenizer for czech characters

unicode,elasticsearch,lucene,tokenize,icu
The icu_tokenizer in elasticsearch seems to break a word into segments when it encounters accented characters such as Č and also returns strange numeric tokes. Example GET /_analyze?text=OBČERSTVENÍ&tokenizer=icu_tokenizer returns "tokens": [ { "token": "OB", "start_offset": 0, "end_offset": 2, "type": "<ALPHANUM>", "position": 1 }, { "token": "268", "start_offset": 4, "end_offset": 7,...

JavaCC: Matching an empty string

java,regex,tokenize,javacc
I am having trouble with ambiguous tokens. My grammar defines two productions, a numeric constant of the form 2e3 or 100e1, and identifiers of the form abc or uvw123. The problem is that e1 is a valid identifier, but also constitutes part of a numeric constant. So for example, if...

Text tokenization with Stanford NLP : Filter unrequired words and characters

java,machine-learning,tokenize,stanford-nlp
I use Stanford NLP for string tokenization in my classification tool. I want to get only meaning words, but I'm getting non-word tokens (like ---, >, . etc.) and not important words like am, is, to (stop words). Does anybody know way to solve this problem?

ElasticSearch Analyzer and Tokenizer for Emails

email,elasticsearch,lucene,tokenize,analyzer
I could not find a perfect solution either in Google or ES for the following situation, hope someone could help here. Suppose there are five email addresses stored under field "email": 1. {"email": "[email protected]"} 2. {"email": "[email protected], [email protected]"} 3. {"email": "[email protected]"} 4. {"email": "[email protected]} 5. {"email": "[email protected]"} I want to...

Writing an expression to recursively extract data between parenthesis

php,regex,recursion,tokenize
I'm trying to write a regular expression to split a string into separate elements inside matching curly braces. First off, it needs to be recursive, and second off, it has to return the offsets (like with PREG_OFFSET_CAPTURE). I actually think this is probably a less efficient way to process this...

Chinese sentence segmenter with Stanford coreNLP

java,nlp,tokenize,stanford-nlp
I'm using the Stanford coreNLP system with the following command: java -cp stanford-corenlp-3.5.2.jar:stanford-chinese-corenlp-2015-04-20-models.jar -Xmx3g edu.stanford.nlp.pipeline.StanfordCoreNLP -props StanfordCoreNLP-chinese.properties -annotators segment,ssplit -file input.txt And this is working great on small chinese texts. However, I need to train a MT system which just requires me to segment my input. So I just need...

using strtok function to tokenize a sentence

c++,token,tokenize,strtok
I am having a little trouble using the strtok() function. What I am trying to do is grab a sentence from a keyboard, then create tokens for every space in the sentence, and then finally print every word that is separated by a space . My current output is blank...

How to pinpoint and use the last token per line in FOR /F Batch file

csv,batch-file,token,tokenize
I am using FOR /F to read the lines of a .csv file to perform XCOPY of various files from one location to another. The columns in the .csv file contain the information of the source and destination folders and filenames. COL1 COL2 COL3 COL4 COL5 1234 From1 Out1 Out2...

Choose formats in sscanf in c

c,string,tokenize,sscanf
I am trying to parse a string Connected to a:b:c:d completed (reauth) id=5 using sscanf() in c language. My format string is Connected to %s completed %s id=%s. But In some cases my string is Connected to a:b:c:d completed id=5. I am not getting that reauth part. I am able...

How to tokenize string by delimiters?

c++,c++11,boost,tokenize,boost-tokenizer
I need to tokenize string by delimiters. For example: For "One, Two Three,,, Four" I need to get {"One", "Two", "Three", "Four"}. I am attempting to use this solultion https://stackoverflow.com/a/55680/1034253 std::vector<std::string> strToArray(const std::string &str, const std::string &delimiters = " ,") { boost::char_separator<char> sep(delimiters.c_str()); boost::tokenizer<boost::char_separator<char>> tokens(str.c_str(), sep); std::vector<std::string> result; for (const...

Need to split string based on delimiters , but those are grouped

java,regex,string,parsing,tokenize
I have a string like String str = "(3456,"hello", world, {ok{fub=100, fet = 400, sub="true"}, null }, got, cab[{m,r,t}{u,u,r,}{r,m,"null"}], {y,i,oft{f,f,f,f,}, tu, yu, iu}, null, null) Now I need to split this string based on comma(,) but the strings which are between {} and [] should not be split. So my...

Lucene Analyzer tokenizer for substring search

java,lucene,tokenize,analyzer
I need a Lucene Tokenizer that can do the following. Given the string "wines bottle caps", the following queries should succeed wine bott cap ottl aps wine bottl Here is what I have so far. How might I modify it to work? No query less than three characters should work....

Trouble tokenizing for binary tree

c,binary-tree,tokenize
I am trying to tokenize a textfile and then put the tokens in a binary tree where the token that has a lower value goes on the left branch of the tree and the token that has a higher value goes to the right and repeated values have an updated...

How to add phrase as a stopword while using lucene analyzer?

java,lucene,tokenize
I am using Lucene 4.6.1 libraries. I am trying to add the word - hip hop in my stopword exclusion list. I can exclude it if its written as - hiphop (as one word) but when its written like hip hop (with space in between) i cannot exclude it. Below...

How do I use NLTK's default tokenizer to get spans instead of strings?

python,nltk,tokenize
NLTK's default tokenizer, nltk.word_tokenizer, chains two tokenizers, a sentence tokenizer and then a word tokenizer that operates on sentences. It does a pretty good job out of the box. >>> nltk.word_tokenize("(Dr. Edwards is my friend.)") ['(', 'Dr.', 'Edwards', 'is', 'my', 'friend', '.', ')'] I'd like to use this same algorithm...