nlp,search-engine,google-search,bing , Where can I find a corpus of search engine queries?


Where can I find a corpus of search engine queries?

Question:

Tag: nlp,search-engine,google-search,bing

I'm interested in training a question-answering system on top of user-generated search queries but so far it looks like such data is not made available. Are there some research centers or industry labs that have compiled corpora of search-engine queries?


Answer:

There are a couple of datasets like this:

Yahoo Weboscope:- http://webscope.sandbox.yahoo.com/catalog.php?datatype=l

Yandex Datasets:- https://www.kaggle.com/c/yandex-personalized-web-search-challenge/data A part of Kaggle problem. You can sign up and download.

There are also AOL Query Logs and MSN Query Logs which had been publicised as part of shared tasks in past 10 years. I'm not sure if they are still public. However you can explore a bit.


Related:


Counting words in list using a dictionary


python,nlp
I have a list of dictionaries containing a word and some misspellings of the word. I am trying to go through a list of strings and first count the occurrences of the the word and then count the occurrences of each misspelling. I have tried using if word in string...

Python NLTK - Making a 'Dictionary' from a Corpus and Saving the Number Tags


python,nlp,nltk,corpus,tagged-corpus
I'm not super experienced with Python, but I want to do some Data analytics with a corpus, so I'm doing that part in NLTK Python. I want to go through the entire corpus and make a dictionary containing every word that appears in the corpus dataset. I want to be...

Save and reuse TfidfVectorizer in scikit learn


python,nlp,scikit-learn,pickle,text-mining
I am using TfidfVectorizer in scikit learn to create a matrix from text data. Now I need to save this object for reusing it later. I tried to use pickle, but it gave the following error. loc=open('vectorizer.obj','w') pickle.dump(self.vectorizer,loc) *** TypeError: can't pickle instancemethod objects I tried using joblib in sklearn.externals,...

What is the default behavior of Stanford NLP's WordsToSentencesAnnotator when splitting a text into sentences?


nlp,stanford-nlp
Looking at WordToSentenceProcessor.java, DEFAULT_BOUNDARY_REGEX = "\\.|[!?]+"; led me to think that the text would get split into sentences based on ., ! and ?. However, if I pass the string D R E L I N. Okay. as input, e.g. using the command line interface: java -cp "*" -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP...

Parsing multiple sentences with MaltParser using NLTK


java,python,parsing,nlp,nltk
There have been many MaltParser and/or NLTK related questions: Malt Parser throwing class not found exception How to use malt parser in python nltk MaltParser Not Working in Python NLTK NLTK MaltParser won't parse Dependency parser using NLTK and MaltParser Dependency Parsing using MaltParser and NLTK Parsing with MaltParser engmalt...

NLP- Sentiment Processing for Junk Data takes time


nlp,stanford-nlp,sentiment-analysis,pos-tagger
I am trying to find the Sentiment for the input text. This test is a junk sentence and when I tried to find the Sentiment the Annotation to parse the sentence is taking around 30 seconds. For normal text it takes less than a second. If i need to process...

How does trec_eval calculates Mean Average Precision (MAP)?


search-engine,information-retrieval,data-retrieval
I'm using TREC_EVAL to evaluate a search engine. I'd like to know how it calculates the Mean Average Precision (MAP). I'm sure it doesn't calculate a simple average of the average precisions (AP). It seems a weighted arithmetic but I can't understand which weights are used.

Stanford Entity Recognizer (caseless) in Python Nltk


python,nlp,nltk
I am trying to figure out how to use the caseless version of the entity recognizer from NLTK. I downloaded http://nlp.stanford.edu/software/stanford-ner-2015-04-20.zip and placed it in the site-packages folder of python. Then I downloaded http://nlp.stanford.edu/software/stanford-corenlp-caseless-2015-04-20-models.jar and placed it in the folder. Then I ran this code in NLTK from nltk.tag.stanford import...

How to split a sentence in Python?


python-3.x,nlp
I need to isolate every single word of a long, natural text in Python3. What is the more efficient way to do this?...

Elasticsearch two sets of terms against two fields


elasticsearch,search-engine
I'm trying to use Elasticsearch to return docs that have different terms in two fields. Not knowing how to write this it would be something like this: query: field1: "term set #1" field2: "very different term set #2" Ideally the term sets would be arrays of strings. I'd like all...

Handling count of characters with diacritics in R


r,unicode,character-encoding,nlp,linguistics
I'm trying to get the number of characters in strings with characters with diacritics, but I can't manage to get the right result. > x <- "n̥ala" > nchar(x) [1] 5 What I want to get is is 4, since n̥ should be considered one character (i.e. diacritics shouldn't be...

Tabulating characters with diacritics in R


r,unicode,nlp,linguistics
I'm trying to tabulate phones (characters) occurrences in a string, but diacritics are tabulated as characters on their own. Ideally, I have a wordlist in International Phonetic Alphabet, with a fair amount of diacritics and several combinations of them with base characters. I give here a MWE with just one...

NLP: Arrange words with tags into proper English sentence?


nlp
lets say I have a sentence: "you hello how are ?" I get output of: you_PRP hello_VBP how_WRB are_VBP What is best way to arrange the wording into proper English sentence like: Hello how are you ? I am new to this whole natural language processing so I am unfamiliar...

What exactly is the difference between AnalysisEngine and CAS Consumer?


nlp,uima
I'm learning UIMA, and I can create basic analysis engines and get results. But What I'm finding it difficult to understand is use of CAS consumers. At the same time I want to know how different it is from AnalysisEngine? From many examples I have seen, CAS consumer is not...

Stanford coreNLP : can a word in a sentence be part of multiple Coreference chains


nlp,stanford-nlp
The question is in the title. Using Stanford's NLP coref module, I am wondering if a given word can be part of multiple coreference chains. Or can it only be part of one chain. Could you give me examples of when this might occur. Similarly, can a word be part...

How to remove a custom word pattern from a text using NLTK with Python


python,regex,nlp,nltk,tokenize
I am currently working on a project of analyzing the quality examination paper questions.In here I am using Python 3.4 with NLTK. So first I want to take out each question separately from the text.The question paper format is given below. (Q1). What is web 3.0? (Q2). Explain about blogs....

The ± 2 window in Word similarity of NLP


vector,nlp,distribution
There is a question illustrate below: //--------question start--------------------- Consider the following small corpus consisting of three sentences: The judge struck the gavel to silence the court. Buying the cheap saw is false economy. The nail was driven in when the hammer struck it hard. Use distributional similarity to determine whether...

Implementing Naive Bayes text categorization but I keep getting zeros


python,algorithm,nlp,text-classification,naivebayes
I am using Naive Bayes for text categorization this is how I created the initial weights for each term in the specified category: term1:number of times term 1 exists/number of documents in categoryA term2:number of times term 2 exists/number of documents in categoryA term3:number of times term 3 exists/number of...

Abbreviation Reference for NLTK Parts of Speach


python,nlp,nltk
I'm using nltk to find the parts of speech for each word in a sentence. It returns abbreviations that I both can't fully intuit and can't find good documentation for. Running: import nltk sample = "There is no spoon." tokenized_words = nltk.word_tokenize(sample) tagged_words = nltk.pos_tag(tokenized_words) print tagged_words Returns: [('There', 'EX'),...

Where can I find a corpus of search engine queries?


nlp,search-engine,google-search,bing
I'm interested in training a question-answering system on top of user-generated search queries but so far it looks like such data is not made available. Are there some research centers or industry labs that have compiled corpora of search-engine queries?

Amazon Machine Learning for sentiment analysis


amazon-web-services,machine-learning,nlp,sentiment-analysis
How flexible or supportive is the Amazon Machine Learning platform for sentiment analysis and text analytics?

How Can I Access the Brown Corpus in Java (aka outside of NLTK)


java,nlp,nltk,corpus,tagged-corpus
I'm trying to write a program that makes use of natural language parts-of-speech in Java. I've been searching on Google and haven't found the entire Brown Corpus (or another corpus of tagged words). I keep finding NLTK information, which I'm not interested in. I want to be able to load...

Are word-vector orientations universal?


nlp,word2vec
I have recently been experimenting with Word2Vec and I noticed whilst trawling through forums that a lot of other people are also creating their own vectors from their own databases. This has made me curious as to how vectors look across databases and whether vectors take a universal orientation? I...

Identify prepositons and individual POS


nlp,stanford-nlp
I am trying to find correct parts of speech for each word in paragraph. I am using Stanford POS Tagger. However, I am stuck at a point. I want to identify prepositions from the paragraph. Penn Treebank Tagset says that: IN Preposition or subordinating conjunction how, can I be sure...

NLP - Word Representations


machine-learning,nlp,artificial-intelligence
I am working on a Word representation algorithm, similar to Word2Vec and GloVe.I have been asked to make it more dynamic, such that new words could be added to the vocabulary,and new documents could be submitted to the program even after the representations (vectors) have been created. The problem is,...

Separately tokenizing and pos-tagging with CoreNLP


java,nlp,stanford-nlp
I'm having few problems with the way Stanford CoreNLP divides text into sentences, namely: It treats ! and ? (exclamation and question marks) inside a quoted text as a sentence end where it shouldn't, e.g.: He shouted "Alice! Alice!" - here it treats the ! after the first Alice as...

Annotator dependencies: UIMA Type Capabilities?


java,annotations,nlp,uima,dkpro-core
In my UIMA application, I have some annotators that must run after a certain annotator has run. At first, I thought about aggregating these annotators together, but I have other annotators that are also dependent on this (and other) annotator, which makes aggregating hard and/or unpractical. I read about Type...

Fast shell command to remove stop words in a text file


shell,nlp,text-processing
I have a 2GB text file. I am trying to remove frequently occurring english stop words from this file. I have stopwords.txt containing like this.. a an the for and I What is the fast method to do this using shell command such as tr, sed or awk? ...

How to not split English into separate letters in the Stanford Chinese Parser


python,nlp,stanford-nlp,segment,chinese-locale
I am using the Stanford Segmenter at http://nlp.stanford.edu/software/segmenter.shtml in Python. For the Chinese segmenter, whenever it encounters a English word, it will split the word into many characters one by one, but I want to keep the characters together after the segmentation is done. For example: 你好abc我好 currently will become...

POS of WSJ in CONLL format from penn tree bank


nlp
I've got the penn tree bank CD. How to convert designated WSJ documents to conll format? Because the original format is in tree structure. E.g. The conll shared task 2000: http://www.cnts.ua.ac.be/conll2000/chunking/ is using treebank. How was this format obtained? Thank you!

Questions about CACM collection


search-engine,information-retrieval,data-retrieval
I'm using CACM document collection. I tried to search more information on this collection online but unfortunately I didn't find what I was looking for. If I've understood correctly, this collection contains documents from a paper journal. As far as this is concerned, I don't understand why every document always...

Coreference resolution using Stanford CoreNLP


java,nlp,stanford-nlp
I am new to the Stanford CoreNLP toolkit and trying to use it for a project to resolve coreferences in news texts. In order to use the Stanford CoreNLP coreference system, we would usually create a pipeline, which requires tokenization, sentence splitting, part-of-speech tagging, lemmarization, named entity recoginition and parsing....

Stanford Parser - Factored model and PCFG


parsing,nlp,stanford-nlp,sentiment-analysis,text-analysis
What is the difference between the factored and PCFG models of stanford parser? (In terms of theoretical working and mathematical perspective)

Create Dictionary from Penn Treebank Corpus sample from NLTK?


python,dictionary,nlp,nltk,corpus
I know that the Treebank corpus is already tagged, but unlike the Brown corpus, I can't figure out how to get a dictionary of tags. For instance, >>> from nltk.corpus import brown >>> wordcounts = nltk.ConditionalFreqDist(brown.tagged_words()) This doesn't work on the Treebank corpus?...

How to interpret scikit's learn confusion matrix and classification report?


machine-learning,nlp,scikit-learn,svm,confusion-matrix
I have a sentiment analysis task, for this Im using this corpus the opinions have 5 classes (very neg, neg, neu, pos, very pos), from 1 to 5. So I do the classification as follows: from sklearn.feature_extraction.text import TfidfVectorizer import numpy as np tfidf_vect= TfidfVectorizer(use_idf=True, smooth_idf=True, sublinear_tf=False, ngram_range=(2,2)) from sklearn.cross_validation...

Can anyone help me make the search bar work as I now have the JS prompt? [on hold]


javascript,html5,search,youtube-api,search-engine
I have created a small program that pulls from the YouTube API which allows you to search for a random video for whatever title you enter when prompted. My goal is to have this work like a search engine. I would like to make my search bar the input instead...

NLTK getting dependencies from raw text


python-2.7,nlp,nltk
I need get dependencies in sentences from raw text using NLTK. As far as I understood, stanford parser allows us just to create tree, but how to get dependencies in sentences from this tree I didn't find out (maybe it's possible, maybe not) So I've started using MaltParser. Here is...

NLP Shift reduce parser is throwing null pointer Exception for Sentiment calculation


nlp,stanford-nlp,sentiment-analysis,shift-reduce
i am trying to find out sentiments using nlp.The version i am using is 3.4.1. I have some junk data to process and it looks around 45 seconds to process using default PCFG file. here is the example String text = "Nm n n 4 n n bkj nun4hmnun Onn...

Natural Language Search (user intent search)


nlp,search-engine,keyword,voice-recognition,naturallyspeaking
I'm trying to build a search engine that allows my users to search with natural language commands, just like Google Now. Except, my search engine is slightly more constrained, in that it is mainly going to be used within an e-commerce site, and allow the users to search for certain...

term frequency over time: how to plot +200 graphs in one plot with Python/pandas/matplotlib?


python,matplotlib,plot,nlp
I am conducting a textual content analysis of several web blogs, and now focusing on finding emerging trends. In order to do so for one blog, I coded a multi-step process: looping over all the posts, finding the top 5 keywords in each post adding them to a list, if...

How to use serialized CRFClassifier with StanfordCoreNLP prop 'ner'


java,nlp,stanford-nlp
I'm using the StanfordCoreNLP API interface to programmatically do some basic NLP. I need to train a model on my own corpus, but I'd like to use the StanfordCoreNLP interface to do it, because it handles a lot of the dry mechanics behind the scenes and I don't need much...

extracting n grams from huge text


python,performance,nlp,bigdata,text-processing
For example we have following text: "Spark is a framework for writing fast, distributed programs. Spark solves similar problems as Hadoop MapReduce does but with a fast in-memory approach and a clean functional style API. ..." I need all possible section of this text respectively, for one word by one...

How to extract derivation rules from a bracketed parse tree?


java,parsing,recursion,nlp,pseudocode
I have a lot of parse trees like this: ( S ( NP-SBJ ( PRP I ) ) ( [email protected] ( VP ( VBP have ) ( NP ( DT a ) ( [email protected] ( NN savings ) ( NN account ) ) ) ) ( . . ) )...

Python NLTK pos_tag not returning the correct part-of-speech tag


python,machine-learning,nlp,nltk,pos-tagger
Having this: text = word_tokenize("The quick brown fox jumps over the lazy dog") And running: nltk.pos_tag(text) I get: [('The', 'DT'), ('quick', 'NN'), ('brown', 'NN'), ('fox', 'NN'), ('jumps', 'NNS'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'NN'), ('dog', 'NN')] This is incorrect. The tags for quick brown lazy in the sentence should be:...

Search box/field design with multiple search locations


python,search,design,search-engine,pyramid
Not sure if this question is better suited for a different StackExchange site but, here goes: I have a search page that searches a number of different type of things. All (at the moment) requiring a different input field for each type of search. For example, one might search for...

stemming words in python


python,nlp,stemming
I'm using this code to stem words, here is how it works, first there's a list of suffixes, the program checks if the word has the ending same as the one in the list if positive it removes the suffix, however, when I run the code I get this result:...

How can the NamedEntityTag be used as EntityMention in RelationMention in the RelationExtractor?


nlp,stanford-nlp
I'm trying to train my own NamedEntityRecognizer and RelationExtractor. I've managed the NER model, but the integration with the RelationExtractor is a bit tricky. I get the right NamedEntityTags, but the RelationMentions found by the are only one-term and with no extra NamedEntity than the default ones. I got input...

Chinese sentence segmenter with Stanford coreNLP


java,nlp,tokenize,stanford-nlp
I'm using the Stanford coreNLP system with the following command: java -cp stanford-corenlp-3.5.2.jar:stanford-chinese-corenlp-2015-04-20-models.jar -Xmx3g edu.stanford.nlp.pipeline.StanfordCoreNLP -props StanfordCoreNLP-chinese.properties -annotators segment,ssplit -file input.txt And this is working great on small chinese texts. However, I need to train a MT system which just requires me to segment my input. So I just need...