ConcurrentUpdateSolrClient vs CloudSolrClient for bulk update to SolrCloud
We have a customer that needs to update few billion documents to SolrCloud. I know the suggested way of using is SolrCloudClient, for its load balancing feature.
As per docs - CloudSolrClient
SolrJ client class to communicate with SolrCloud. Instances of this class communicate with Zookeeper to discover Solr endpoints for SolrCloud collections, and then use the LBHttpSolrClient
to issue requests. This class assumes the id field for your documents is called 'id' - if this is not the case, you must set the right name with setIdField(String)
As per the docs - ConcurrentUpdateSolrClient
ConcurrentUpdateSolrClient buffers all added documents and writes them into open HTTP connections. This class is thread safe. Params from UpdateRequest
are converted to http request parameters. When params change between UpdateRequests a new HTTP request is started. Although any SolrClient request can be made with this implementation, it is only recommended to use ConcurrentUpdateSolrClient with /update requests. The class HttpSolrClient
is better suited for the query interface.
Now since with ConcurrentUdateSolrClient I am able to use a queue and a pool of threads, which makes it more attractive to use over CloudSolrClient which will use a HTTPSolrClient once it gets a set of nodes to do the updates.
I would love to hear more in depth discussion on these 2 APIs.
@sdutta in SolrCloud you should be using CloudSolrClient class. It should take care of everything you mentioned. Gets the active Solr servers from Zookeeper. And when you add the document, it will automatically send it to the server which is hosting the shard for the id, etc. It also keeps track if any Solr server is out of commission and automatically reconfigures itself.
CloudSolrClient solrCloudClient = new CloudSolrClient(zkHosts);
Bosco, CloudSolrClient will return an LBHTTPClient (which load balances across the nodes). But I do not see that LBHTTPClient is multithreaded. So, the question begs, which has a higher throughput?
You will have to first see where the bottle neck is. Regardless how much you are going to push to the Solr server, it can only index only so many. If you feel transport is the main issue, then you can just create couple of threads and each thread can have it's own solrClient instance.
Secondly, you need to batch all your requests and you shouldn't commit from the client side. You should configure auto-commit on the Solr Server side and let it do the final commit. Between Solr doing the buffering v/s you doing the batching, I am not sure what would be the difference.
Throwing my 2 cents in since I've spent an insane amount of time working with Solr on this exact problem.
ConcurrentUpdateSolrClient is really easy to get going and you can get a high throughput just by increasing the number of threads. However, at some point it just won't be scalable or efficient once you have a bunch of Solr nodes.
If you are using Solr Cloud, then the CloudSolrClient is definitely the recommended way to go but, in my experience, it is much, much harder to get high throughput. Batching documents is pretty much a requirement. You can't really just increase the number of threads because each one opens a connection to Zookeeper.
If you decide to go with CloudSolrClient, take a look at the code in storm-solr.
I posted on the Solr community and got the below answer from a Committer :-
It's usually not all that difficult to write a multi-threaded client that uses CloudSolrClient, or even fire up multiple instances of the SolrJ client (assuming they can work
on discreet sections of the documents you need to index).
That avoids the problem Shawn alludes to. Plus other
issues. If you do not use CloudSolrClient, then all the
docs go to some node in the system that then sub-divides
the list (and you really should update in batches, see:
then the node that receives the packet sub-divides it
into groups based on what shard they should be part of
and forwards them to the leaders for that shard, very
significantly increasing the numbers of conversations
being carried on between Solr nodes. Times the number
of threads you're specifying with CUSC (I really regret
the renaming from ConcurrentUpdateSolrServer, I liked
writing CUSS).
With CloudSolrClient, you can scale nearly linearly with
the number of shards. Not so with CUSC.