Indexing Data in Solr 1.4 Enterprise Search Server: Part1Ahren Stevens-Taylor
(For more resources on Solr, see here.)
Lets get started.
Communicating with Solr
There are a few dimensions to the options available for communicating with Solr:
Direct HTTP or a convenient client API
Applications interact with Solr over HTTP. This can either be done directly (by hand, but by using an HTTP client of your choice), or it might be facilitated by a Solr integration API such as SolrJ or Solr Flare, which in turn use HTTP.
An exception to HTTP is offered by SolrJ, which can optionally be used in an embedded fashion with Solr (so-called Embedded Solr) to avoid network and inter process communication altogether. However, unless you are sure you really want to embed Solr within another application, this option is discouraged in favor of writing a custom Solr updating request handler.
Data streamed remotely or from Solr’s Filesystem
Even though an application will be communicating with Solr over HTTP, it does not have to send Solr data over this channel. Solr supports what it calls remote streaming. Instead of giving Solr the data directly, it is given a URL that it will resolve. It might be an HTTP URL, but more likely it is a filesystem based URL, applicable when the data is already on Solr’s machine. Finally, in the case of Solr’s DataImportHandler, the data can be fetched from a database.
The following are the different data formats:
- Solr-XML: Solr has a specific XML schema it uses to specify documents and their fields. It supports instructions to delete documents and to perform optimizes and commits too.
- Solr-binary: Analogous to Solr-XML, it is an efficient binary representation of the same structure. This is only supported by the SolrJ client API.
- CSV: CSV is a character separated value format (often a comma).
- Rich documents like PDF, XLS, DOC, PPT to Solr: The text data extracted from these formats is directed to a particular field in your Solr schema.
- Finally, Solr’s DIH DataImportHandler contrib add-on is a powerful capability that can communicate with both databases and XML sources (for example: web services). It supports configurable relational and schema mapping options and supports custom transformation additions if needed. The DIH uniquely supports delta updates if the source data has modification dates.
We’ll use the XML, CSV, and DIH options in bringing the MusicBrainz data into Solr from its database to demonstrate Solr’s capability. Most likely, an application would use just one format.
Before these approaches are described, we’ll discuss curl and remote streaming, which are foundational topics.
Using curl to interact with Solr
Solr receives commands (and possibly the associated data) through HTTP POST.
Solr lets you use HTTP GET too (for example, through your web browser). However, this is an inappropriate HTTP verb if it causes something to change on the server, as happens with indexing. For more information on this concept, read about REST at:
One way to send an HTTP POST is through the Unix command line program curl (also available on Windows through Cygwin). Even if you don’t use curl, it is very important to know how we’re going to use it, because the concepts will be applied no matter how you make the HTTP messages.
There are several ways to tell Solr to index data, and all of them are through HTTP POST:
- Send the data as the entire POST payload (only applicable to Solr’s XML format). curl does this with data-binary (or some similar options) and an appropriate content-type header reflecting that it’s XML.
- Send some name-value pairs akin to an HTML form submission. With curl, such pairs are proceeded by -F. If you’re giving data to Solr to be indexed (as opposed to it looking for it in a database), then there are a few ways to do that:
- Put the data into the stream.body parameter. If it’s small, perhaps less than a megabyte, then this approach is fine. The limit is configured with the multipartUploadLimitInKB setting in solrconfig.xml.
- Refer to the data through either a local file on the Solr server using the stream.file parameter or a URL that Solr will fetch it from through the stream.url parameter. These choices are a feature that Solr calls remote streaming.
Here is an example of the first choice. Let’s say we have an XML file named artists.xml in the current directory. We can post it to Solr using the following command line:
http://localhost:8983/solr/update -H 'Content-type:text/xml;charset=utf-8' --data-binary @artists.xml
If it succeeds, then you’ll have output that looks like this:
<?xml version="1.0" encoding="UTF-8"?> <response> <lst name="responseHeader"> <int name="status">0</int><int name="QTime">128</int> </lst> </response>
To use the solr.body feature for the example above, you would do this:
In both cases, the @ character instructs curl to get the data from the file instead of being @artists.xml literally. If the XML is short, then you can just as easily specify it literally on the command line:
http://localhost:8983/solr/update-F stream.body=' <commit />'
Notice the leading space in the value. This was intentional. In this example, curl treats @ and < to mean things we don’t want. In this case, it might be more appropriate to use form-string instead of -F. However, it’s more typing, and I’m feeling lazy.
In the examples above, we’ve given Solr the data to index in the HTTP message. Alternatively, the POST request can give Solr a pointer to the data in the form of either a file path accessible to Solr or an HTTP URL to it.
The file path is accessed by the Solr server on its machine, not the client, and it must also have the necessary operating system file permissions too.
However, just as before, the originating request does not return a response until Solr has finished processing it. If you’re sending a large CSV file, then it is practical to use remote streaming. Otherwise, if the file is of a decent size or is already at some known URL, then you may find remote streaming faster and/or more convenient, depending on your situation.
Here is an example of Solr accessing a local file:
To use a URL, the parameter would change to stream.url, and we’d specify a URL. We’re passing a name-value parameter (stream.file and the path), not the actual data.
Remote streaming must be enabled
In order to use remote streaming (stream.file or stream.url), you must enable it in solrconfig.xml. It is disabled by default and is configured on a line that looks like this:
<requestParsers enableRemoteStreaming="true" multipartUploadLimitInKB="2048" />