nlp,search-engine,google-search,bing , Where can I find a corpus of search engine queries?

Where can I find a corpus of search engine queries?


Tag: nlp,search-engine,google-search,bing

I'm interested in training a question-answering system on top of user-generated search queries but so far it looks like such data is not made available. Are there some research centers or industry labs that have compiled corpora of search-engine queries?


There are a couple of datasets like this:

Yahoo Weboscope:-

Yandex Datasets:- A part of Kaggle problem. You can sign up and download.

There are also AOL Query Logs and MSN Query Logs which had been publicised as part of shared tasks in past 10 years. I'm not sure if they are still public. However you can explore a bit.


NLP - Word Representations

I am working on a Word representation algorithm, similar to Word2Vec and GloVe.I have been asked to make it more dynamic, such that new words could be added to the vocabulary,and new documents could be submitted to the program even after the representations (vectors) have been created. The problem is,...

What is the default behavior of Stanford NLP's WordsToSentencesAnnotator when splitting a text into sentences?

Looking at, DEFAULT_BOUNDARY_REGEX = "\\.|[!?]+"; led me to think that the text would get split into sentences based on ., ! and ?. However, if I pass the string D R E L I N. Okay. as input, e.g. using the command line interface: java -cp "*" -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP...

Chinese sentence segmenter with Stanford coreNLP

I'm using the Stanford coreNLP system with the following command: java -cp stanford-corenlp-3.5.2.jar:stanford-chinese-corenlp-2015-04-20-models.jar -Xmx3g edu.stanford.nlp.pipeline.StanfordCoreNLP -props -annotators segment,ssplit -file input.txt And this is working great on small chinese texts. However, I need to train a MT system which just requires me to segment my input. So I just need...

Implementing Naive Bayes text categorization but I keep getting zeros

I am using Naive Bayes for text categorization this is how I created the initial weights for each term in the specified category: term1:number of times term 1 exists/number of documents in categoryA term2:number of times term 2 exists/number of documents in categoryA term3:number of times term 3 exists/number of...

How can the NamedEntityTag be used as EntityMention in RelationMention in the RelationExtractor?

I'm trying to train my own NamedEntityRecognizer and RelationExtractor. I've managed the NER model, but the integration with the RelationExtractor is a bit tricky. I get the right NamedEntityTags, but the RelationMentions found by the are only one-term and with no extra NamedEntity than the default ones. I got input...

Fast shell command to remove stop words in a text file

I have a 2GB text file. I am trying to remove frequently occurring english stop words from this file. I have stopwords.txt containing like this.. a an the for and I What is the fast method to do this using shell command such as tr, sed or awk? ...

Stanford coreNLP : can a word in a sentence be part of multiple Coreference chains

The question is in the title. Using Stanford's NLP coref module, I am wondering if a given word can be part of multiple coreference chains. Or can it only be part of one chain. Could you give me examples of when this might occur. Similarly, can a word be part...

extracting n grams from huge text

For example we have following text: "Spark is a framework for writing fast, distributed programs. Spark solves similar problems as Hadoop MapReduce does but with a fast in-memory approach and a clean functional style API. ..." I need all possible section of this text respectively, for one word by one...

term frequency over time: how to plot +200 graphs in one plot with Python/pandas/matplotlib?

I am conducting a textual content analysis of several web blogs, and now focusing on finding emerging trends. In order to do so for one blog, I coded a multi-step process: looping over all the posts, finding the top 5 keywords in each post adding them to a list, if...

How to split a sentence in Python?

I need to isolate every single word of a long, natural text in Python3. What is the more efficient way to do this?...

Tabulating characters with diacritics in R

I'm trying to tabulate phones (characters) occurrences in a string, but diacritics are tabulated as characters on their own. Ideally, I have a wordlist in International Phonetic Alphabet, with a fair amount of diacritics and several combinations of them with base characters. I give here a MWE with just one...

What exactly is the difference between AnalysisEngine and CAS Consumer?

I'm learning UIMA, and I can create basic analysis engines and get results. But What I'm finding it difficult to understand is use of CAS consumers. At the same time I want to know how different it is from AnalysisEngine? From many examples I have seen, CAS consumer is not...

How to interpret scikit's learn confusion matrix and classification report?

I have a sentiment analysis task, for this Im using this corpus the opinions have 5 classes (very neg, neg, neu, pos, very pos), from 1 to 5. So I do the classification as follows: from sklearn.feature_extraction.text import TfidfVectorizer import numpy as np tfidf_vect= TfidfVectorizer(use_idf=True, smooth_idf=True, sublinear_tf=False, ngram_range=(2,2)) from sklearn.cross_validation...

Elasticsearch two sets of terms against two fields

I'm trying to use Elasticsearch to return docs that have different terms in two fields. Not knowing how to write this it would be something like this: query: field1: "term set #1" field2: "very different term set #2" Ideally the term sets would be arrays of strings. I'd like all...

Create Dictionary from Penn Treebank Corpus sample from NLTK?

I know that the Treebank corpus is already tagged, but unlike the Brown corpus, I can't figure out how to get a dictionary of tags. For instance, >>> from nltk.corpus import brown >>> wordcounts = nltk.ConditionalFreqDist(brown.tagged_words()) This doesn't work on the Treebank corpus?...

How to not split English into separate letters in the Stanford Chinese Parser

I am using the Stanford Segmenter at in Python. For the Chinese segmenter, whenever it encounters a English word, it will split the word into many characters one by one, but I want to keep the characters together after the segmentation is done. For example: 你好abc我好 currently will become...

NLP- Sentiment Processing for Junk Data takes time

I am trying to find the Sentiment for the input text. This test is a junk sentence and when I tried to find the Sentiment the Annotation to parse the sentence is taking around 30 seconds. For normal text it takes less than a second. If i need to process...

Abbreviation Reference for NLTK Parts of Speach

I'm using nltk to find the parts of speech for each word in a sentence. It returns abbreviations that I both can't fully intuit and can't find good documentation for. Running: import nltk sample = "There is no spoon." tokenized_words = nltk.word_tokenize(sample) tagged_words = nltk.pos_tag(tokenized_words) print tagged_words Returns: [('There', 'EX'),...

stemming words in python

I'm using this code to stem words, here is how it works, first there's a list of suffixes, the program checks if the word has the ending same as the one in the list if positive it removes the suffix, however, when I run the code I get this result:...

NLP Shift reduce parser is throwing null pointer Exception for Sentiment calculation

i am trying to find out sentiments using nlp.The version i am using is 3.4.1. I have some junk data to process and it looks around 45 seconds to process using default PCFG file. here is the example String text = "Nm n n 4 n n bkj nun4hmnun Onn...

NLTK getting dependencies from raw text

I need get dependencies in sentences from raw text using NLTK. As far as I understood, stanford parser allows us just to create tree, but how to get dependencies in sentences from this tree I didn't find out (maybe it's possible, maybe not) So I've started using MaltParser. Here is...

Counting words in list using a dictionary

I have a list of dictionaries containing a word and some misspellings of the word. I am trying to go through a list of strings and first count the occurrences of the the word and then count the occurrences of each misspelling. I have tried using if word in string...

Can anyone help me make the search bar work as I now have the JS prompt? [on hold]

I have created a small program that pulls from the YouTube API which allows you to search for a random video for whatever title you enter when prompted. My goal is to have this work like a search engine. I would like to make my search bar the input instead...

How does trec_eval calculates Mean Average Precision (MAP)?

I'm using TREC_EVAL to evaluate a search engine. I'd like to know how it calculates the Mean Average Precision (MAP). I'm sure it doesn't calculate a simple average of the average precisions (AP). It seems a weighted arithmetic but I can't understand which weights are used.

Questions about CACM collection

I'm using CACM document collection. I tried to search more information on this collection online but unfortunately I didn't find what I was looking for. If I've understood correctly, this collection contains documents from a paper journal. As far as this is concerned, I don't understand why every document always...

How to remove a custom word pattern from a text using NLTK with Python

I am currently working on a project of analyzing the quality examination paper questions.In here I am using Python 3.4 with NLTK. So first I want to take out each question separately from the text.The question paper format is given below. (Q1). What is web 3.0? (Q2). Explain about blogs....

Stanford Parser - Factored model and PCFG

What is the difference between the factored and PCFG models of stanford parser? (In terms of theoretical working and mathematical perspective)

Amazon Machine Learning for sentiment analysis

How flexible or supportive is the Amazon Machine Learning platform for sentiment analysis and text analytics?

Python NLTK - Making a 'Dictionary' from a Corpus and Saving the Number Tags

I'm not super experienced with Python, but I want to do some Data analytics with a corpus, so I'm doing that part in NLTK Python. I want to go through the entire corpus and make a dictionary containing every word that appears in the corpus dataset. I want to be...

How Can I Access the Brown Corpus in Java (aka outside of NLTK)

I'm trying to write a program that makes use of natural language parts-of-speech in Java. I've been searching on Google and haven't found the entire Brown Corpus (or another corpus of tagged words). I keep finding NLTK information, which I'm not interested in. I want to be able to load...

Stanford Entity Recognizer (caseless) in Python Nltk

I am trying to figure out how to use the caseless version of the entity recognizer from NLTK. I downloaded and placed it in the site-packages folder of python. Then I downloaded and placed it in the folder. Then I ran this code in NLTK from nltk.tag.stanford import...

Natural Language Search (user intent search)

I'm trying to build a search engine that allows my users to search with natural language commands, just like Google Now. Except, my search engine is slightly more constrained, in that it is mainly going to be used within an e-commerce site, and allow the users to search for certain...

The ± 2 window in Word similarity of NLP

There is a question illustrate below: //--------question start--------------------- Consider the following small corpus consisting of three sentences: The judge struck the gavel to silence the court. Buying the cheap saw is false economy. The nail was driven in when the hammer struck it hard. Use distributional similarity to determine whether...

Python NLTK pos_tag not returning the correct part-of-speech tag

Having this: text = word_tokenize("The quick brown fox jumps over the lazy dog") And running: nltk.pos_tag(text) I get: [('The', 'DT'), ('quick', 'NN'), ('brown', 'NN'), ('fox', 'NN'), ('jumps', 'NNS'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'NN'), ('dog', 'NN')] This is incorrect. The tags for quick brown lazy in the sentence should be:...

NLP: Arrange words with tags into proper English sentence?

lets say I have a sentence: "you hello how are ?" I get output of: you_PRP hello_VBP how_WRB are_VBP What is best way to arrange the wording into proper English sentence like: Hello how are you ? I am new to this whole natural language processing so I am unfamiliar...

Annotator dependencies: UIMA Type Capabilities?

In my UIMA application, I have some annotators that must run after a certain annotator has run. At first, I thought about aggregating these annotators together, but I have other annotators that are also dependent on this (and other) annotator, which makes aggregating hard and/or unpractical. I read about Type...

Parsing multiple sentences with MaltParser using NLTK

There have been many MaltParser and/or NLTK related questions: Malt Parser throwing class not found exception How to use malt parser in python nltk MaltParser Not Working in Python NLTK NLTK MaltParser won't parse Dependency parser using NLTK and MaltParser Dependency Parsing using MaltParser and NLTK Parsing with MaltParser engmalt...

How to use serialized CRFClassifier with StanfordCoreNLP prop 'ner'

I'm using the StanfordCoreNLP API interface to programmatically do some basic NLP. I need to train a model on my own corpus, but I'd like to use the StanfordCoreNLP interface to do it, because it handles a lot of the dry mechanics behind the scenes and I don't need much...

POS of WSJ in CONLL format from penn tree bank

I've got the penn tree bank CD. How to convert designated WSJ documents to conll format? Because the original format is in tree structure. E.g. The conll shared task 2000: is using treebank. How was this format obtained? Thank you!

How to extract derivation rules from a bracketed parse tree?

I have a lot of parse trees like this: ( S ( NP-SBJ ( PRP I ) ) ( [email protected] ( VP ( VBP have ) ( NP ( DT a ) ( [email protected] ( NN savings ) ( NN account ) ) ) ) ( . . ) )...

Separately tokenizing and pos-tagging with CoreNLP

I'm having few problems with the way Stanford CoreNLP divides text into sentences, namely: It treats ! and ? (exclamation and question marks) inside a quoted text as a sentence end where it shouldn't, e.g.: He shouted "Alice! Alice!" - here it treats the ! after the first Alice as...

Identify prepositons and individual POS

I am trying to find correct parts of speech for each word in paragraph. I am using Stanford POS Tagger. However, I am stuck at a point. I want to identify prepositions from the paragraph. Penn Treebank Tagset says that: IN Preposition or subordinating conjunction how, can I be sure...

Save and reuse TfidfVectorizer in scikit learn

I am using TfidfVectorizer in scikit learn to create a matrix from text data. Now I need to save this object for reusing it later. I tried to use pickle, but it gave the following error. loc=open('vectorizer.obj','w') pickle.dump(self.vectorizer,loc) *** TypeError: can't pickle instancemethod objects I tried using joblib in sklearn.externals,...

Search box/field design with multiple search locations

Not sure if this question is better suited for a different StackExchange site but, here goes: I have a search page that searches a number of different type of things. All (at the moment) requiring a different input field for each type of search. For example, one might search for...

Where can I find a corpus of search engine queries?

I'm interested in training a question-answering system on top of user-generated search queries but so far it looks like such data is not made available. Are there some research centers or industry labs that have compiled corpora of search-engine queries?

Handling count of characters with diacritics in R

I'm trying to get the number of characters in strings with characters with diacritics, but I can't manage to get the right result. > x <- "n̥ala" > nchar(x) [1] 5 What I want to get is is 4, since n̥ should be considered one character (i.e. diacritics shouldn't be...

Coreference resolution using Stanford CoreNLP

I am new to the Stanford CoreNLP toolkit and trying to use it for a project to resolve coreferences in news texts. In order to use the Stanford CoreNLP coreference system, we would usually create a pipeline, which requires tokenization, sentence splitting, part-of-speech tagging, lemmarization, named entity recoginition and parsing....

Are word-vector orientations universal?

I have recently been experimenting with Word2Vec and I noticed whilst trawling through forums that a lot of other people are also creating their own vectors from their own databases. This has made me curious as to how vectors look across databases and whether vectors take a universal orientation? I...