Achieving High Availability in Enterprise Search
As organizations roll out search applications to end-users in order to provide a single point of information access, a funny thing happens - users become reliant on it and expect that search will be always “on” - just like it is on the web. However, many search applications on the market were not designed to address the demands of high availability, leaving customers to develop their own solutions and workarounds to address the problem. Given that search is becoming much more ubiquitous within the enterprise, designing search software with system availability should be a requirement for vendors. Let’s take a look at the challenges of how to make search reliable and the problems associated with an adhoc approach.
Enterprise search applications can be divided into two parts - the search front-end and the search back-end. The front-end is the portion of the application that your users will see and interact with. The front-end sends requests (web service, proprietary API, or REST requests) to one or more back-end machines to get the raw search results that will be presented to the user. The back-end machines contain search collections and interact only with the front-end machines, never directly with the user. Making the front-end reliable is much the same as making any other web application reliable. Many articles and books have been written on this subject.
I want to focus on the search specific challenges related to the back-end of your search application. This includes addressing how the front-end should communicate with the back-end. Fundamentally, if you want to have very highly available search, then you cannot have a single point of failure. If your search is installed on a single machine, that is a single point of failure. No matter how robust and reliable you make the software, as soon as the physical hardware breaks, the search application will no longer be working. For that matter, all it takes is someone to pull the wrong network cable in a data center and there goes your search.
To be reliable you simply need to remove all these single points of failure. Sounds easy, right? Clone your back-end machine and put both copies of it behind a load-balancer and call it done. For terminology, the cloning of the machines is called “replicating” the data.
First of all, that just pushes the problem up to the load balancer (your new single point of failure) and the network between your load balancer and your front-end machines. But that’s not the only problem. While the load balancer is a reasonable solution for websites, things get a fair bit more complicated for a search application.
The first challenge is dealing with updates to the underlying data being searched. For example, if you are offering a corporate email search, new emails are arriving constantly and users are deleting and moving emails between folders. The search must be updated to reflect these changes in a timely manner. The two machines could update themselves independently (giving you the redundancy that you want), but this doubles the load on your mail server. Independently updating the machines also has the unfortunate property that the same search can (and likely will) return different search results depending on which machine you are assigned to by the load balancer. Since the load balancer sits between the front-end and back-end, hitting “reload” for the search results will likely make them change.
This is a problem, but you can probably live with slight differences in the search results and the extra load of having two machines crawling your email server. But how would you feel if your enterprise kept growing and you needed more servers to handle all of your users? Would you feel comfortable with five search machines all returning different results, putting 5x the load on your mail server and using 5x as much of your bandwidth?
Not to mention the challenge of keeping the configuration on the redundant machines in sync. Every time you change anything on one back-end machine you need to make sure that change gets applied to all the back-end machines. Manual synchronization like this is extremely error prone and I can guarantee that it wouldn’t take too long before you made a mistake. Automating the process is challenging because you are limited by whatever flexibility is offered by the search-engine itself (is there a config file that you can copy? What makes up the configuration? How can you tell the software that you’ve changed the config manually?). Again, this is an area where you need the search software to be an active participant.
The only reasonable solution that will scale is to have a machine which is responsible for collecting and processing the updated information. Periodically, this updated information is atomically (or as close as possible to atomically) pushed out to all the machines. As part of this pushing process, the configuration from the one master machine needs to be transmitted to the other back-end servers, eliminating the need for manual changes to all the servers.
I’ve seen people try to take care of this themselves by writing scripts and utilities to move the data around between machines. To be honest, I’ve never seen people get this right. There are so many subtle problems that can crop up. What if more data gets indexed while the script is running? What if the script runs too soon? How well are network errors handled? How can you synchronize the distribution? How do you push changes to the data and the configuration if they are not compatible? Given all the issues, this is a feature that should be pushed into the actual search software. Without tight coordination between the replication and the indexing, you will invariably encounter strange problems and glitches in the system.
The second challenge is on the front-end side which must decide where and how to get results from back-end machine(s). If you decided to use a load balancer on top of the cloned machines, there isn’t much to talk about here. You are constrained to whatever flexibility the load balancer offers. You are probably going to have to setup a special script to be able to get the load balancer to recognize whether or not all the collections on a back-end machine are really in a good state (the load balancer is going to need to decide whether a back-end machine is “up” or “down” where “up” has to mean that each and every collection hosted on the machine is currently able to be searched). Using a load balancer also limits how you can scale the architecture to provide truly reliable search, because all machines need to be physically located close together and at some level must share a common network connection.
If you’ve left the load balancer behind and are going for the more general solution, then we need to look at how the front-end machines decide where they should send a request for search results. No surprises here, the front-end machines need to periodically poll the back-end machines to determine which are ready and able to handle queries (for a given collection). It’s actually important to make this decision on a per collection basis because, for whatever reason (including human error), you may have one collection in a bad state while everything else on that back-end machine is functioning correctly. So, the front-end machine polls the back-end machines to find out their state and based on a history of these tests (and an appropriate load balancing policy), the front-end machines can select the appropriate destination for the query. People have written code and scripts to do this on their own.
But once again, I would say that doing it on your own - outside of the search engine, that is - is a mistake. You have to make sure that the testing and the searching is coordinated. If you thought it was hard to write the scripts to push the data, wait until you try to deal with the search. Conceptually it is easy, but the details can lead to many small and dangerous problems. What happens when a test times out? What happens when the testing stops running? How do you know that the collection is really in good shape? The terrible thing about these types of problems is that they only manifest themselves when something else goes horrendously wrong, like a switch failing, a power outage, etc. And, as you probably know, this is most likely to happen at 4am on a Sunday morning.
Assuming that your search software provides the capabilities to replicate collections and make availability decisions on the front-end, what can you do with a real application?
We host a customer’s search application which must be available from around the world, involves a dozen search collections and this customer really wanted 100% up-time. Making this work was a design and logistical challenge. They use two banks of load-balanced front-end machines in two geographically dispersed data centers which contact back-end machines in other data centers (the front-end and back-end machines couldn’t be combined due to logistic reasons). All of the data centers have multiple redundant network connections and the upstream networks themselves are completely separate (a network issue in one data center is unlikely to affect the other).
After working through many little glitches (when I say that making it work correctly is hard, I am speaking from experience) it runs with virtually no down-time. We have had many hardware failures (disks, routers/switches), network failures, power outages with dead UPSes, etc. Most of these problems would have brought down the application if was located in only a single location using only a load balancing switch. By distributing it geographically it’s possible to survive these problems with no noticeable effect on the application, which is an important lesson to learn. Equally important was how tightly coupled the test and replication mechanisms need to be.
As a last word, there’s actually an even better reason why you should insist that your enterprise search software take care of reliability issues. That’s because you can then insist that it also will deal directly with scalability issues, which I’ll get to in my next posting.
Technorati search for links to this article
Post this article to Digg (must be logged in)
Post this article to del.icio.us (must be logged in)
Post this article to Reddit (must be logged in)
Post this article to Furl (must be logged in)
Post this article to Spurl (must be logged in)