solr,cloudera,flume , How to strip HTML content in flume morphline.conf file using Xquery

How to strip HTML content in flume morphline.conf file using Xquery


Tag: solr,cloudera,flume

We are trying to index the sample xml files to cloudera solr using flume MorphlineSolrSink.

We have created 2 channels ( solrchannel, hdfschannel) and 2 sink (solrsink, hdfssink).   We are able to index the document in cloudera solr using this flume and morphline configuration.

Question 1) : We have 2 fields title and content in XML file and we want to strip the HTML content from these 2 fields before sending it to SOLR. Could you please tell how we can achieve it?

Question 2) : I have to change the Date format of 2 fields, createDate and PublishedDate. Could you please let me know how to write the logic to change the dateformat of both the fileds at one go.

I am using xQuery to extract the date from my XML files.



I found the following solution for my problem and hence I wanted to share with you guys:

2) After the Xquery command block I wrote following code to convert the date into required format and it worked perfectly fine.

    convertTimestamp {
      field : createDate
      inputFormats : ["E MMM dd HH:mm:ss z yyyy", "yyyy-MM-dd"]
      inputTimezone : UTC
      outputFormat : "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'"
      outputTimezone : America/Los_Angeles

    convertTimestamp {
      field : publishedDate
      inputFormats : ["E MMM dd HH:mm:ss z yyyy", "yyyy-MM-dd"]
      inputTimezone : UTC
      outputFormat : "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'"
      outputTimezone : America/Los_Angeles

1) For Stripping the HTML tags from title and content we have written a Java code and that we have plugged into our pipeline before send the file content to flume.

Hope this Helps you as well!!!!


Jayesh Bhoyar


Developing a search and tag heavy website

I'm in the planning phase of developing a very tag heavy website. Everything will essentially be associated with tags and the entire site would be based on searching these tags. Now, I've been thinking a lot about going the nosql route here, since from what I read and understand, it...

Solr custom UpdateRequestProcessorFactory fails with “Error Instantiating UpdateRequestProcessorFactory”

I have a custom class extending UpdateRequestProcessorFactory doing some work on a document when it gets added to the index. This was working fine in v4.10.3 in standalone Solr. I moved to SolrCloud v5.2 and it throws this error when adding the Collection (node): ERROR - 2015-06-14 12:25:11.071; [ docs_shard1_replica1]...

Using schema.xml with Solr

I am trying to use schema.xml with the latest version of Solr (5.1.0). It seems that by default Solr 5.1.0 uses managed schema, but I would like to use schema.xml for a specific collection. So I create a new collection (using solr create -c my_collection on windows and copy schema.xml...

solrException. XML parser doesn't support XInclude option

After configuring solr4.7.2 with tomcat 7, got the error in solrAdmin page stating SolrCore Initialization Failures fran92:org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: XML parser doesn't support XInclude option My solr.xml file contains one core <?xml version="1.0" encoding="UTF-8" ?> <solr persistent="true"> <cores host="${host:}" adminPath="/admin/cores" hostContext="${hostContext:solr}"> <core config="solrconfig.xml" name="fran92" instanceDir="generic" schema="schema.xml"...

Solr: Retrieve non-stored fields from external data source

I'm currently working on a project on which I would like to index several data sources (Oracle and HBase) into Solr for full text search. Additionally, I want to be able to visualize the data I index into Solr. I'm still evaluating on whether to use Banana or Hue for...

TYPO3 Solr extension and facets

I have a small question about TYPO3 solr facets.At present in my website I have 6 different indexing configuration available. Two of them are custom extension table's and one is tt_news and rest of the 3 are pages table with some custom condition. I managed to add this using additionalWhereClause...

Connection refused when trying to access SOLR instance running in boot2docker on windows

I pulled this SOLR docker image and then followed the instructions to run it. docker run -d -p 8983:8983 -t makuk66/docker-solr Typing in docker ps yielded 1197d246f0e3 makuk66/docker-solr:latest "/bin/bash -c '/opt/ 50 minutes ago Up 50 minutes>8983/tcp suspicious_sinoussi So I know it's running. In order to connect to it...

Is it possible to index views in Apache Solr

Let me first give you an example. I have two tables -table1 and table2. table1 has a field id_table2, which is a foreign key and references one of the fields in table2. So, when I want to scan table1, I make a query like: SELECT t1.attr_1_, t1.attr_2_, t2.attr_3_ FROM table1...

How to add individual objects to django haystack?

I have a search index that I have created using Solr. I want to add individual django objects to the search index. To remove objects from the solr database we use remove_object. some = SomFooModel.objects.get(pk=1) foo = FooIndex() foo.remove_object(some) #This works To add it, is there something like add_object or...

Heap size issue on migrating from Solr 5.0.0 to Solr 5.1.0

I have a Solr 5.0.0 in production with a custom heap size like this SOLR_JAVA_MEM="-Xms2g -Xmx2g" When I tried to migrate to Solr 5.1.0 with the same configuration and start the server it returned a OutOfMemoryError. Looking to the Solr API I saw that the heap size was set to...

Solrcloud multicore configuration

I have a standalone Solr instance with 4 different cores working fine using the embedded Jetty server. I configured the cores for v4.10.3 but since I moved to v5.1 and all seems to work fine without any changes. Before going into production, I need to set it up as a...

Solr Cloud Managed Resources

I am implementing Solr Cloud for the first time. I've worked with normal Solr and have that down pretty well, but I'm not finding a lot on what you can and can't do with Solr Cloud. So my question is about Managed Resources. I know you can CRUD stop words...

DSpace error with oai import

After configuring my DSpace server, its working correctly but when I look at the OAI identify page ( so we can be harvested, it says that the repository is localhost instead of my URL. I investigated and found out that to update this, I have to run this command: dspace/bin/dspace...

Django-Haystack with Solr: Searching by page description meta tags

I've been digging around and can't seem to find a way to create a search index for the page description meta tags using Haystack and Solr. Does anyone have experience with this, or any tips? I have looked at the page model in cms, but can't figure out how to...

Apache Solr Exception

Hello I am trying to run Solr on a Tomcat and have an exception like org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: directory '/var/lib/solr/data/index' does not exist Maybe anyone has some trouble like I do?...

can solr find all of the terms of a field of a document?

solr uses inverted index to find the document from the indexed "terms". but what I wonder is that - is there any approach to know all of the terms which refer to a specific documents? thanks...

How can I sort by realtime score in solr?

Now I have a solr collection: question question has some field: id answer_count created_at updated_at now I have the sort rule: score = answer_count * 100 - (the hours now to created_at) * 5 then I need to sort by the score desc. how can i do that because of...

Data import in solr from multiple entities

Currently i have a Solr core, which is importing data from multiple entities, i.e 2 different MySQL tables. I have to import data in the same core through 3rd entity which is another core in the same Solr Database. I found a documentation on many different sites which were guiding...

solr bin/post - specify a document ID

I am quite new to solr as such, and have set up everything as per the example, and it all works fine. However, I have one nagging issue, for which I do not seem to find a solution for. So, normally, I do the following using the SimplePostTool and it...

Cloudera Twiiter Hive Query failure

Team, Curious to know if anyone succeeded in executing query for Twitter Cloudera Example? I added mentioned SerDe Jar in Beewax file resources as Jar, still I am getting the error for any query. Query: SELECT t.retweeted_screen_name, sum(retweets) AS total_retweets, count(*) AS tweet_count FROM (SELECT retweeted_status.user.screen_name as retweeted_screen_name, retweeted_status.text, max(retweet_count)...

Copy files from Remote unix and windows servers into HDFS without intermediate staging

I am trying to see if there is anything for copying files from remote unix and windows servers into HDFS without intermediate staging from the command line. Thanks for the help...

Solr : stemming in a live cluster (reindexing issues)

I have a live Solr cluster where stemming was not enabled and my schema.xml looks like this: .. <field name="Searchable_Text" type="text_general" indexed="true" stored="true" multiValued="false"/> .. <field name="text" type="text_general" indexed="true" stored="false" multiValued="true"/> .. <copyField source="Searchable_Text" dest="text" maxChars="3000"/> .. <fieldType name="text_general" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer...

Solr 5.1.0: How to set the unique key via Schema API

In Solr 5.1.0, is it possible to set the unique key via the REST schema api? I created a collection with the data driven schema. Solr would guess what the field type and create the field based on the data I upload. I can still define fields beforehand by sending...

SOLR QueryElevationComponent for Multi-tenant Support

Newbie question so please be nice. :) Basically we need to implement editorial boosting for a multi-tenant SOLR environment wherein a pre-defined query from a user would always bring a certain set of documents at the top of the results. A couple of challenges we have include: Given a single...

Partially indexing Cassandra table with SOLR

One of the tables inside our Cassandra (DSE 4.7) Cluster contains south of 15 billion records. With the number of servers we have - it would be impossible to index them all with Solr. So, is it possible to somehow index the data partially/sample and/or start indexing and then "pause"...

Solr 5.1.0 - Apache TikaEntityProcessor Cannot Find My Files

Solr, more specifically Tika, is having some problems finding my file whose filepath is retrieved from a database. Whenever I go to index it logs errors saying that this can't find the file. I'm basically doing what this guy is doing here, which is taking a file path from a...

Still seeing old shard after calling SPLITSHARD

I called splitshard, and now this is what I see even after posting a commit: I thought splitshard was supposed to get rid of the original shard, shard1, in this case. Am I missing something? I was expecting the only two remaining shards to be shard1_0 and shard1_1. The REST...

dse cassandra solr doesnt return _uniqueKey in response

Im using Datastax 4.6. My solr client queries data by using _uniqueKey. From version 4.6 the limitation about using simple primary key is removed. How can i configure solr or create table in cassandra, so that I receive in solr response information about synthetic key _uniqueKey. There is no problem...

Rails4 + sunspot search

I am trying to use sunspot solr for searching with Rails 4 and mysql. I defined a searchable block in my model(eg XYZ): searchable do text :name, :stored => true string :id, :stored => true end I just want to search in "name". The "id" is the primary key. There...

How to store the file path of an indexed document in Apache Solr 5.1.0

I'm trying to store the file path of an locally stored indexed document in Apache Solr so I can then update the index with metadata that is stored in a DB in MySQL. That file path is how I'm going to relate the document to its corresponding metadata I already...

How to add multiple suggesters definition in solr search components

I am using solr 5.1. I am trying to configure multiple suggester definition in Solr search component according to Apache solr wiki. I have configured single suggester perfectly and it works perfect but whenever I try to configure multiple suggester it gives me following errors java.lang.NullPointerException at org.apache.solr.handler.component.SearchHandler.handleRequestBody( at org.apache.solr.handler.RequestHandlerBase.handleRequest(

Heap memory Solr and Elasticsearch

I'm just reading the book Mastering Apache Solr and the writer recommends to set the minimum heap size (-Xms) to 2GB and the maximum heap size (-Xmx) to 12GB. Is 2GB necessary? I just use a 512MB server (which is low, I know) for Solr and I found it already...

Solr date variable resolver is not working with MySql

I have used Solr 3.3 version as Data Import Handler(DIH) with Oracle. Its working fine for me. Now I am trying the same with Mysql. With the change in database, I have changed the query used in data-config.xml for MySql. The query has variables which are passed url in http....

Subentity SolrEntityProcessor stops working since SolR 5.x

I use a data import like this <dataConfig> <document name="products"> <entity name="outer" dataSource="my_datasource" pk="id" query="..." deltaQuery="..." deltaImportQuery="..." > <entity name="solr" processor="SolrEntityProcessor" url="${}" query="Xid:${outer.Xid}" rows="1" fl="Id,FieldA,FieldB" wt="javabin" /> </entity> </document> </dataConfig> The interesting part is the sub entity, which uses SolrEntityProcessor. Until (including) SoLR 4.10 everything...

Solr splits a field containing a URL when copying from destination to a copyfield

I'm using Solr 4.5.1 and i have these two fields indexed in solr : schema.xml <field name="event_id" type="custom_string" indexed="true" stored="true" /> <field name="text" type="text_fr" indexed="true" multiValued="true" stored="true"/> <copyField source="event_id" dest="text"/> <fieldType name="text_fr" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <!-- normalisation des accents,...

solrcloud - choosing cores for update and search requests

I have a SolrCloud with one collection configured with compositeId and numShards=3 and replicationFactor=2. there will be about 200K inserts a day and about as many searches. from the SolrCloud documentation: "If the machine is a replica, the document is forwarded to the leader for processing." Does this means that...

What indexer do I use to find the list in the collection that is most similar to my list?

Lets say I have my list of ingredients: {'potato','rice','carrot','corn'} and I want to return lists from a database that are most similar to mine: {'beans','potato','oranges','lettuce'}, {'carrot','rice','corn','apple'} {'onion','garlic','radish','eggs'} My query would return this first: {'carrot','rice','corn','apple'} I've used Solr, and have looked at CloudSearch, ElasticSearch, Algolia, Searchify and Swiftype. These engines only...

Getting application/json back from a Solr query

I'm calling the Solr REST api using a Jersey client: final ClientResponse resp = client().path(queryPath()) .queryParam("q", query.getQuery()) .queryParam("wt", "json") .accept(MediaType.APPLICATION_JSON_TYPE) .get(ClientResponse.class); resp.getEntity(HttpResponse.class) and when I run it I get: A message body reader for Java class challenger.HttpResponse, and Java type class challenger.HttpResponse, and MIME media type text/plain; charset=UTF-8 was not...

How to index documents with their metadata in a DB using Solr 5.1.0

I'm using Apache Solr to index documents for a search engine. These documents are stored locally on my file system. In order to do a faceted search I also have to include these documents meta-data which is stored in a MySQL DB. Is there a way to simultaneously index these...

SOLR to ignore some terms of a phrase

Is there a way to tell SOLR to search for (for example) 80% of the phrase "term1 term2 term3 term4" will yeild documents with at least 3 terms. Extra question - if such logic exists - will it work with proximity : "term1 term2 term3 term4"~15 specifically, tried to do...

Understanding Apache Lucene's scoring algorithm

I'm working with Hibernate Search for months now, but still I'm not able to digest the relevance it brings. I'm overall satisfied with the results it returns, but even simplest test does not satisfy my expectation. First test was using the term frequency(tf). Data: word word word word word word...

Solr 4.10.2 MySQL import fails with

I'm trying to migrate a server with Solr 4.7.2 on it. I have a Solr 4.10.2 with 4 cores running which is the new machine. I have an importer running on the old machine that poses no problem. However, when trying to run the importer on the new machine, I...

Lucene vs Solr, indexning speed for sampe data

I have worked upon Lucene before and now moving towards Solr. The problem is that I am not able to do Indexing on Solr as fast as Lucene can do. My Lucene Code: public class LuceneIndexer { public static void main(String[] args) { String indexDir = "/home/demo/indexes/index1/"; IndexWriterConfig indexWriterConfig =...

How to use all the cores of Solr in solrj

I have downloaded solr 5.2.0 and have started using $solr_home/bin/solr start The Logs stated: Waiting to see Solr listening on port 8983 [/] Started Solr server on port 8983 (pid=17330). Happy searching! Then I visited http://localhost:8983/solr and created a new core using Core Admin / new Core as Core1 (...

Flume-ng hdfs sink .tmp file refresh rate control proprty

I am trying to refresh the .tmp file with additional events in every 5 minutes, my source is slow and it takes 30 min to get 128MB file in my hdfs sink. Is there any property in flume hdfs sink where I can control the refresh rate of .tmp file...

How do I combine Facet and FilterQueries using Spring data Solr?

Is it possible to combine a facet and field query in spring data solr? Something that would build a query like this: > http://localhost:8983/solr/myCore/select?q=lastName%3AHarris*&fq=filterQueryField%3Ared&wt=json&indent=true&facet=true&facet.field=state In other words, how do I add FilterParameters to a SimpleFacetQuery? Any/all replies welcome, thanks in advance, -- Griff...

SOLR - highlight searching text ? Is this possible

I'm beginning with SOLR so please don't flame me if this question is stupid or something like this. I was reading solr documentation and found out that there is something called "highlight". I have really simple query: /select?q=text:test&wt=json&indent=true text is a field in my index and I'm trying to highlight...

Solr boost direct match over fuzzy match

Let's say I have a query like this: text_data:(Apple OR Apple~2) How do I know what boost value to provide to give the direct match a clear priority over the fuzzy match?...

How does ReversedWildcardFilterFactory speed up wildcard searches?

The Solr docs say: solr.ReversedWildcardFilterFactory A filter that reverses tokens to provide faster leading wildcard and prefix queries. Add this filter to the index analyzer, but not the query analyzer. The standard Solr query parser will use this to reverse wildcard and prefix queries to improve performance... How does it...

Fuzzy search not working with dismax query parser

There is a field in my schema 'fullText' which is of the 'text_en' type, and multivalued. The term 'tests' is in the fullText field in one document. In solr, when I try to search using the word 'test', with the standard lucene parser with minimal distance 1, its returning the...