FAQ Database Discussion Community


Euclidean vs Cosine for text data

text,data-mining,information-retrieval,euclidean-distance,cosine-similarity
IF I use tf-idf feature representation (or just document length normalization), then is euclidean distance and (1 - cosine similarity) basically the same? All text books I have read and other forums, discussions say cosine similarity works better for text... I wrote some basic code to test this and found...

How to get certain information out of arraylist grouped into other lists in Java

java,arraylist,text-files,group,information-retrieval
I wrote a program, that reads multiple (similar) textfiles out of a Folder. Im splitting the information by space and store everything in one arraylist which contains data kind of this: key1=hello key2=good key3=1234 ... key15=repetition key1=morning key2=night key3=5678 ... Now I'm looking for a way to get those information...

How does trec_eval calculates Mean Average Precision (MAP)?

search-engine,information-retrieval,data-retrieval
I'm using TREC_EVAL to evaluate a search engine. I'd like to know how it calculates the Mean Average Precision (MAP). I'm sure it doesn't calculate a simple average of the average precisions (AP). It seems a weighted arithmetic but I can't understand which weights are used.

Keep non-stemmed tokens on Elasticsearch

elasticsearch,information-retrieval
I'm using a stemmer (for the Brazilian Portuguese Language) when I index documents on Elasticsearch. This is what my default analyzer looks like(nvm minor mistakes here because I've copied this by hand from my code in the server): "analysis":{ "filter":{ "my_asciifolding": { "type":"asciifolding", "preserve_original":true, }, "stop_pt":{ "type": "stop", "ignore_case": true,...

calculating tf-idf for web pages

information-retrieval,tf-idf
I am new to IR and I would like to calculate tf-idf for webpages. For the "tf" part, I want to calculate see frequency of each word in content of one webpage. For the "idf" part, I want to compare multiple webpages for the content. Is there a tool/API that...

How to define a CAS in database as external resource for an annotator in uimaFIT?

nlp,data-mining,information-retrieval,uima
I am trying to structure my a data processing pipeline using uimaFit as follows: [annotatorA] => [Consumer to dump annotatorA's annotations from CAS into DB] [annotatorB (should take on annotatorA's annotations from DB as input)]=>[Consumer for annotatorB] The driver code: /* Step 0: Create a reader */ CollectionReader readerInstance= CollectionReaderFactory.createCollectionReader(...

Questions about CACM collection

search-engine,information-retrieval,data-retrieval
I'm using CACM document collection. I tried to search more information on this collection online but unfortunately I didn't find what I was looking for. If I've understood correctly, this collection contains documents from a paper journal. As far as this is concerned, I don't understand why every document always...