FAQ Database Discussion Community

tika PackageParser does not work with directories

I am writing a class to recursively extract files from inside a zip file and produce them to a Kafka queue for further processing. My intent is to be able to extract files from multiple levels of zip. The code below is my implementation of the tika ContainerExtractor to do...

run external program in Ruby IO.popen : rescue not working

I'm using the Tika jar to extract metadata from Microsoft Word doc files but in the case Tika encounters a problem my rescue is not catching the error, instead the scripts exits. I'm on windows 7 with MRI Ruby 1.9.3 I could adapt the doc file but I want to...

Talend iterate on tTikaExtractor

I'm trying to use tTikaExtractor component to extract the content of several files in a folder. It is working with a single file but when I add a tFileList component, I don't understand how to get the content of the 2 different files. I think it is something related to...

Extract text content from Tika without specifying the file header

Is there a way to extract content from a file with a Tika server without explicitly defining the header? For example for a specific file named "file.pdf" if I do curl -X PUT --data-binary @file.pdf localhost:9998/tika --header "Content-type: application/pdf" > file.txt I get the extracted content in "file.txt" but if...

How to use Tika via PHP when both installed on one server?

I need to make an internal website which allows users to upload .doc, .pdf, .xls files and see the text in a textarea box. I have created the site in PHP to the point where a user can upload the files. I have installed Tika on my server and...

Can't get correct Key-Value Pairs with Tika

I'm trying to get the Metadata Values from an Office Document and all it shows as key-value pair is this one: Content-Type: application/zip I just can't tell the issue in this one. Why does it only show the Content-Type? What i'm interested in are Keys like title. import java.io.FileInputStream; import...

Solr 5.1.0 - Apache TikaEntityProcessor Cannot Find My Files

Solr, more specifically Tika, is having some problems finding my file whose filepath is retrieved from a database. Whenever I go to index it logs errors saying that this can't find the file. I'm basically doing what this guy is doing here, which is taking a file path from a...

Calling a RESTEasy Client Proxy interface, how can I specify which Content-type the endpoint will consume?

I want to PUT, via binary, to an endpoint that can consume one of many possible mimetypes. Specifically, I am communicating with an Apache Tika server, which could take, say, a PDF or a Word .docx file. I've set up a client proxy interface that I can hardcode, say, the...

Integrate Apache TIKA and Solr Cell with Solr to index pdf and word documents

I am doing a POC to index pdf and word documents using solr search engine. I tried to search about detailed level information or articles but did not get\found any detailed article to do it. What I found is to use some solr package provided example. That is not I...

In Java to use tika… how to resolve 'java.io.IOException'

I am using tika with Java for crawling program. I have used BSF_Recursive for that. After some results, it shows me this... http://www.google.com Exception in thread "main" java.io.IOException: Server returned HTTP response code: 403 for URL: http://translate.google.com/ at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(Unknown Source) at sun.net.www.protocol.http.HttpURLConnection.getInputStream(Unknown Source) ...

Why does the Tika facade choose EmptyParser?

I'm using the Tika facade, per the example of the elasticsearch-mappper-attachment plugin. Here's my test code: Tika tika = new Tika(); Metadata md = new Metadata(); try { String content = tika.parseToString(src, md, 100000); System.out.println("Content length: " + content.length()); for (String s: md.names()) { System.out.println(s + ": " + md.get(s));...

How to deploy tika-server-1.7.jar on tomcat

How to deploy tika-server as WAR file, under a servlet container Tomcat? I prefer to deploy without using in maven.