Scroll Top

Solr DIH (Data Import Handler) – alternative

apache-karaf

From Solr 9 onwards, the Solr Data Import Handler will no longer be available under the Solr umbrella (SOLR-14066 and SOLR-14783). This removal is part of the effort of making Solr more secure (SOLR-13442) and we completely understand this direction taken by the Solr community.

However, this leaves the current users of the Solr DIH (Data Import Handler) with a bunch of questions. So now is a good point in time to evaluate the direction to take for the coming years.

The quickest and probably easiest way for existing users is to move towards the migrated 3rd party plugin available at https://github.com/rohitbemax/dataimporthandler. However, my guess is that quite a few users have already experienced some issues with its functionality: the data import handler has limitations that drove users towards other ETL tools to handle the data import task outside of Solr.

The main advantage of the DIH is that it is quite simple to implement. Having said that, it still requires technical resources to implement. Those technical people often are not impressed by ‘simple’ solutions and the DIH shows opportunities for improvement:

  • DIH lacks visibility on the different elements of the data import process: gathering data / mapping to solr / inserting into solr
  • DIH lacks possibilties to tune the process (throttling, thread pool size)
  • DIH is not ‘SolrCloud’ aware (the process runs within one node of the cloud)
  • DIH in solr admin UI has some errors

This made us switch to a solution that is not part of Solr: a tool that is integrate with Solr but can be extended to talk with lot’s of different data sources on the other end. So we moved to a platform that is dedicated to integrate systems. Running Camel (https://camel.apache.org/) in a karaf container (https://karaf.apache.org/) provides us with a flexible solution that allows us to design, configure, monitor and tune the data import process for our search engines.

The following features will be implemented in the default implementation of our ‘Search Companion’ for Solr:

  • full and delta imports, persisting the data import state in zookeeper or file
  • fully support for solrcloud and stand-alone Solr implementation
  • data retrieval via jdbc with connection pooling and paging through the result set
  • handling collection alias roll-over (full-import in background and roll-over collection alias when completed)
  • configurable processing parameters per collection (security, field mapping, …)
  • UI to follow up details of the process

On top of this, the apache karaf/camel platform allows to implement a lot of other processes that are required when working seriously with a search engine:

  • launch bulk solr queries to evaluate impact of changes acros implementations/versions in terms of performance and relevance
  • controlled upload and activation of specific configuration files in the solr implementation based on folder watcher to allow management of some solr config files by business users
  • publish REST services with swagger UI
  • integrate with prometheus and grafana
  • ssh shell access to manage the services
  • configure jdbc datasources (incl. solr-jdbc) and enquiry from shell
  • alerting options upon errors (slack, mail, …)

This approach is – besides being very useful and flexible – also future-proof: we are building a solution within a tool/framework designed for integration with other systems:

  • new source of data required? re-use process as Camel supports all mainstream data sources and data formats
  • new export required? swich from/to other search engine, existing process can easily be redirected to other systems following standard Enterprise Integration Patterns

Not less important, this toolset is based on our beloved java language.

For several years now, we are following this approach and feel very confident that this approach will be the preferred one for many years to come.

And, as we believe this approach could be valuable for a lot of Solr (and other) users, search-solutions is now working on the cleaning and documentation of this ‘Search Companion’ for Solr. When the first draft is ready, we’ll publish these artifacts as open source (under Apache License) and hopefully the communicty gives this approach a try!

UPDATE 20/04/2022:
We published a first version of our search companion. See https://www.search-solutions.net/companion/

For any questions, feedback or remarks, feel free to contact us.