# solr-ids-export-plugin **Repository Path**: mirrors_allegro/solr-ids-export-plugin ## Basic Information - **Project Name**: solr-ids-export-plugin - **Description**: No description available - **Primary Language**: Unknown - **License**: Apache-2.0 - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2020-09-24 - **Last Updated**: 2026-05-23 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # IdsExportPlugin # A plugin (to be more precise: set of plugins) for Solr allowing time-efficient export of Ids of all found documents (or any DocValues-enabled field values) in comma-separated format without sorting. Lack of result sorting results in significantly better performance then Solr build-in `/export` endpoint. Note: the plugin is developed and tested on standalone Solr instance, without any promises nor guarantees about Solr Cloud. ## Requirements * Solr version > 7.2 (tested with 7.2.1) * Solr running in standalone mode (Solr Cloud not supported) ## Motivation The initial motivation for creating this plugin was ability to produce output, which could be used as a direct input for [Terms Query Parser](http://yonik.com/solr-terms-query/) in another Solr request. Example: > First, search Car Brands index and give me IDs of all brands, which sell in Poland > > http://localhost:8080/car_brands/select?availability:pl&fq={!ids field=brand_id}&wt=ids > Output: vw,opel,audi > > Then, search Car Models index and give me models with electric engine: > > http://localhost:8080/car_models/select?engine:electric&fq={!terms f=brand_id}vw,opel,audi Other possible use cases include: * simplifying batch jobs which do some calculation based on a full result set and doesn't require any document order (f.ex. _recalculate popularity for all product from Poland every day_) - removes necessity of paging * creating reports - finding all documents matching criteria * replacing `/export` endpoint when sorting is not required ## Basic concepts IdsExportPlugin consists of: * `IdsExportFilter` * `IdsExportSearchComponent` * `IdsExportResponseWriter` The idea behind IdsExportPlugin is to use a post-filter (`IdsExportFilter`) as the last filter during the request processing phase, which will collect all found Document Ids in an optimized data structure. Then `IdsExportSearchComponent` will write those Ids to the response, and `IdsExportResponseWriter` will output them in comma-separated format. ### `IdsExportFilter` `IdsExportFilter` is a Solr post-filter. In Solr terminology, a filter is a piece of code which decides, whether the document matches search criteria and should be included in the response. A post-filter will be executed after regular filters, thanks to this it works on limited set of documents, already filtered by previous filters. `IdsExportFilter` implements the post-filter interface, but doesn't really decides if a document matches search criteria or not - it accepts all documents - but instead it collects certain field values from documents, and stores them in a data structure. The field name is defined in the request [URL or configuration](#configuration). This filter was initially designed to read values of the documents' unique key, but in fact it can read values of any field, which has [DocValues](https://lucene.apache.org/solr/guide/7_2/docvalues.html) enabled. **In this document we will refer to those values as `Ids`.** Internally, Ids are stored in a data structure: * in case of fields with Numeric or Sorted Numeric DocValues, Ids (which are longs) are stored inside [`com.carrotsearch.hppc.LongArrayList`](http://carrotsearch.github.io/hppc/releases/0.7.3/api/com/carrotsearch/hppc/LongArrayList.html) (data structure based on array of primitive longs) * in case of fields with Binary, Sorted or SortedSet DocValues, Ids (which are Strings) are stored as an ArrayList of [`org.apache.lucene.util.BytesRef`](https://lucene.apache.org/core/7_2_1/core/org/apache/lucene/util/BytesRef.html) (lucene-optimized type for string binary values, mainly used for Strings) Ids don't need to be unique - in case of repeated values, it will be stored a couple of times. ### `IdsExportSearchComponent` `IdsExportSearchComponent` is a search component (piece of code which executes after request processing, but before sending the response) which simply adds the collected Ids to the Solr response under the key defined in the [configuration](#configuration). After this operation, response will contain additional list of Ids of all documents. ### `IdsExportResponseWriter` The last component, `IdsExportResponseWriter`, transforms the Solr response into comma-separated list of Ids. All additional response elements are skipped. The MIME type of the response is set to `text/plain`, encoding set to `UTF-8`. Note: usage of `IdsExportResponseWriter` is optional. If you don't want a comma-separated format and you're fine with standard Solr JSON/XML/etc. response - then you don't have to use `IdsExportResponseWriter`. ## Installation 1. Add JAR file to Solr's classpath ([https://lucene.apache.org/solr/guide/7_2/lib-directives-in-solrconfig.html](https://lucene.apache.org/solr/guide/7_2/lib-directives-in-solrconfig.html)) 2. Add to `solrconfig.xml` following code ```xml 100000 ids ids ``` The exact meaning of configuration parameters is described in [Configuration](#configuration) Each of those components may be registered under any valid name. * The name of the `IdsExportFilterParserPlugin` (which is a factory for `IdsExportFilter`) will be reflected in Solr URL (you will use it in requests to activate the plugin) - please give it some reasonable name. **In this document we will assume the name `ids`**
http://localhost:8080/solr/core_name/select?q=*:*&fq={!ids field=product_id}
* We strongly recommend to give `IdsExportSearchComponent` the same name as in `IdsExportFilterParserPlugin` for simplicity. * The name of the `IdsExportResponseWriter` will be reflected in Solr URL (you will use it to change the output format) - please give it some reasonable name. We recommend the same name as in `IdsExportFilterParserPlugin` for simplicity. **In this document we will assume the name `ids`**
http://localhost:8080/solr/core_name/select?q=*:*&fq={!ids field=product_id}&wt=ids
## Usage examples ``` http://localhost:8080/solr/core_name/select?q=*:*&fq={!ids field=product_id}&wt=ids # Will output a comma-separated values of `product_id` field from all documents in the index. Example response: 1,2,3,4,5,6 ``` ``` http://localhost:8080/solr/core_name/select?q=*:*&fq={!ids field=product_id}&rows=2 # Will output a list of values of `product_id` field as an additional Solr's response attribute. Example response: { "responseHeader": { "status": 0, "QTime": 2, "params": { "q": "*:*", "fq": "{!ids field=product_id}", "rows": "2" } }, "response": { "numFound":6, "start": 0, "docs": [ { "product_name": "Test 0", "product_id": "0" }, { "product_name": "Test 1", "product_id": "1" } ] }, "ids": [ "0", "1", "2", "3", "4", "5", "6" ] } # Note: ids doesn't respect rows/start parameters - will always output everything found. ``` ## Configuration `IdsExportFilterParserPlugin` configuration options available in `solrconfig.xml`: * `bufferInitialSize` - initial size (in number of items) of the buffer for storing Ids. It should be a bit bigger than estimated average response size. Generally every number will work, however: - if set too low, the buffer will be extended a couple of times during request processing, resulting in increased CPU and memory consumption - if set too high, you will unnecessarily allocate a lot of memory Default value: *100 000*. * `defaultIndexField` - name of the field, where Ids are stored. This can be configured also on a per-request basics via URL parameter `field`, however in case of missing URL parameter the default configured here will be used. Default value: *doc_id*. `IdsExportFilter` configuration options available in URL: * `field` - name of the field, where Ids are stored. Default value: configured in `defaultIndexField` in `solrconfig.xml`
http://localhost:8080/solr/core_name/select?q=*:*&fq={!ids field=product_id}
`IdsExportSearchComponent` configuration options available in `solrconfig.xml`: * `responseKey` - key in the Solr response where Ids should be stored. Default value: *ids*. `IdsExportResponseWriter` configuration options available in `solrconfig.xml`: * `responseKey` - key in the Solr response where Ids are stored. The final Solr output will contains only comma-separated values from this field. Default value: *ids*. * `separator` - a separator (char or String) used to separate values in the final Solr output. **In this document we will assume it is a comma**, therefore we have used phrase "comma-separated" a couple of times, however it's possible to change it. Default value: *,* (comma) ## Performance ### Single query time comparison In this test scenario, a single Solr instance was processing only a single request at once. Each request was sent three times to Solr: 1. To `/select` endpoint, with `rows=0`, and `IdsExportPlugin` enabled 2. To `/export` endpoint, with sorting set to `Ids` (sorting was obligatory) 3. To `/select` endpoint, with `rows` set to expected size od result set and sorting set to `Ids`, without `IdsExportPlugin` Note: given times are the total request time, including sending HTTP request, searching and downloading HTTP response. Technically, times were measured using linux `time` command, which measured execution time of `curl` with a given query. Although this approach is not a "clean" benchmark of the plugin itself, it also takes into account the overhead required to download a potentially large response - and this also favors IdsExportPlugin, due to the very concise format of the output data - but it is also the closest to the actual use cases of the plugin. Results (times in seconds): | numFound | IdsExportPlugin | /export | /select | |----------|-----------------|---------|---------| | 2 | 0.036 | 0.012 | 0.008 | | 1082 | 0.012 | 0.128 | 0.136 | | 12957 | 0.02 | 1.956 | 1.949 | | 225816 | 0.149 | 55.105 | 59.068 | | 1841320 | 0.681 | 393.532 | 396.918 | | 5971685 | 2.232 | 831.853 | 822.736 | ### Multi-threaded performance In this test scenario, a single Solr instance was processing requests incoming via multiple connections concurrently. Each request was sent to two endpoints: 1. To `/select` endpoint, with `rows=0`, and `IdsExportPlugin` enabled 2. To `/export` endpoint, with sorting set to `Ids` (sorting was obligatory) The test scenario has been divided into three test cases. In each test case a set of unique phrases has been used, selected to give the expected number of results: 1. between 20 000 and 50 000 (phrases giving "small" result sets) 2. between 50 000 and 280 000 (phrases giving "medium sized" result sets) 3. between 280 000 and 3 100 000 (phrases giving "large" result sets) This test scenario was carried out using Apache JMeter. All results presented below come from JMeter results. Results: | concurrent connections | requests per connection | total request count | result set size per request | IdsExportPlugin RPS | IdsExportPlugin avg | IdsExportPlugin Max | /export RPS | /export avg | /export max | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | 30 | 80 | 2400 | 20000-50000 | 489.50 rps | 47.00 ms | 190.00 ms | 3.00 rps | 9414.00 ms | 26972.00 ms | | 30 | 80 | 2400 | 50000-280000 | 199.10 rps | 127.00 ms | 325.00 ms | 0.80 rps | 35663.00 ms | 126313.00 ms | | 30 | 22 | 660 | 280000-3100000 | 30.00 rps | 796.00 ms | 2669.00 ms | 0.10 rps | 230305.00 ms | 812294.00 ms | ### Performance - summary The presented results clearly show that the use of `IdsExportPlugin` highly speeds up Ids export from Solr - response time and throughput may be a **couple of hundred times better** than in case of Solr built-in `/export` or `/select` endpoints. The largest performance killer `/export` and `/select` is result set sorting. `IdsExportPlugin` does not perform any sorting, just outputs all found Ids in order they are processed by Solr. ### Memory consumption Memory consumption of `IdsExportPlugin` is not higher then memory consumption of the standard `/export` endpoint. On the one hand, `IdsExportPlugin` require a data structure which size is proportional to the amount of found documents, so the bigger result sets are found, the more memory is required for processing. On the other hand, standard `/export` endpoint also require some data structure with size proportional to the result set size for sorting purposes. Therefore the overall memory footprint of `IdsExportPlugin` will not be higher then `/export`'s. **Pro tip:** Generally it's best to use `IdsExportPlugin` with fields, which have DocValues of type `Numeric` or `SortedNumeric` - in this case the data structure is [`com.carrotsearch.hppc.LongArrayList`](http://carrotsearch.github.io/hppc/releases/0.7.3/api/com/carrotsearch/hppc/LongArrayList.html), which internally relies on array of primitive longs. All other field types will store it's Ids inside ArrayList of [`org.apache.lucene.util.BytesRef`](https://lucene.apache.org/core/7_2_1/core/org/apache/lucene/util/BytesRef.html) - an optimized way of storing Strings. ## Build `./gradlew clean build` ## License This software is published under [Apache License 2.0](http://www.apache.org/licenses/LICENSE-2.0).