From Solr 9 onwards, the Solr Data Import Handler will no longer be available as functionality provided under the Solr umbrella. This removal is part of the effort of making Solr more secure and we completely understand this direction taken by the Solr community.
However, this leaves the current users of the Solr Data Import Handler with a bunch of question marks. It is at least a good point in time to evaluate the direction going forward .
The quickest and probably easiest way for existing users is to move towards the migrated 3rd party plugin available at https://github.com/rohitbemax/dataimporthandler. However, my guess is that quite a few existing users have already experienced some issues with this piece of Solr functionality: the data import handler has limitations that drove users towards other ETL tools to handle the data import task outside of Solr.
The main advantage of the DIH is that it is quite simple to implement. Having said that, it still requires technical resources to implement. Those technical people often are not impressed by ‘simple’ solutions and the DIH shows opportunities for improvement:
- DIH lacks visibility on the different elements of the data import process: gathering data / mapping to solr / inserting into solr
- DIH lacks possibilties to tune the process (throttling, thread pool size)
- DIH is not ‘SolrCloud’ aware (the process runs within one node of the cloud)
This made us switch to solutions based on a platform that is dedicated to integrate systems. Running Camel (https://camel.apache.org/) in a karaf container (https://karaf.apache.org/) provides us with a flexible solution that allows us to design, configure, monitor and tune the data import process for our search engines.
The following features will be implemented in the default implementation:
- full and delta imports, persisting the data import state in zookeeper or file depending of usage of solrcloud or not
- data retrieval via jdbc with connection pooling and records paging
- handling collection alias roll-over (full-import in background and roll-over collection alias when completed)
- configurable processing parameters per collection (security, field mapping, …)
On top of this, the apache karaf/camel platform allows to implement a lot of other processes that are required when working seriously with a search engine:
- launch bulk solr queries to evaluate impact of changes acros implementations in terms of performance and relevance
- controlled upload and activation of specific configuration files in the solr implementation
- publish REST services with swagger UI
- integrate with prometheus and grafana
- ssh shell access to the services
- configure jdbc datasources (incl. solr-jdbc) and enquiry from shell
- alerting options upon errors (slack, mail, …)
This approach is – besides being very useful and flexible – also future-proof: we are building a solution within a tool/framework designed for integration with other systems:
- new source of data required? re-use process as Camel supports all mainstream data sources and data formats
- new export required? swich from/to other search engine, existing process can easily be redirected to other systems following standard Enterprise Integration Patterns
Not less important, this toolset is based on our beloved java language.
For several years now, we are following this approach and feel very confident that this approach will be the preferred one for many years to come.
And, as we believe this approach could be valuable for a lot of Solr (and other) users, search-solutions is now working on the cleaning and documentation of the data import code. When ready, we’ll publish these artifacts as open source (under Apache License) and hopefully the communicty gives this approach a try!
For any questions, feedback or remarks, feel free to contact us.