Jerome Pesenti

What’s Wrong with Google’s Enterprise Search Security? (Part 1)

I just stumbled onto a post from December on the Google enterprise search blog arguing that of the two primary methods for implementing document-level search security, ACL indexing (early binding) and search-time result by result checking (late binding), only the latter is truly secure. Sounds to me like a prime example of a vendor trying hard to make a virtue out of a serious product limitation (Google search appliances currently do not support the early-binding method):

“While we agree with Mark [reference to this article] on some of the benefits with using early-binding security filtering, there are certain limitations that make it impractical (if not impossible) to use for most deployments today”.

Contrary to what the Google spokesperson says in the quote above, there is actually broad consensus in the industry that of the two methods - early binding is in most cases the most practical - and a combination of both methods should be used for high requirement deployments.

There are multiple reasons for this:

1. In many instances late-binding - which requires a network request for every top result considered - is way too inefficient to be practical. When indexing emails or SharePoint team sites, most people have permissions to access just a very small portion of the content. With the late-binding method, in order to be able to return the results that the end user has permission to see (let’s say 10 results), the search application needs to consider and check many more results (possibly hundreds) until finding 10 that they are allowed to see. So, in this example, you would have hundreds of network requests for each query, adding tremendous load to the network infrastructure and creating very high latencies – in some cases up to minutes.

The search administrator is then confronted with a dilemma, either set a high timeout and allow searches to run for minutes, or set a low timeout and take the risk of returning partial and inconsistent results. This problem is compounded when using search results clustering or results de-duplication, as both techniques require a higher number of results to be retrieved.

2. This inherent inefficiency described above actually makes the late-binding approach highly insecure. Using a stopwatch, a malicious user can easily craft probing queries and guess that there is content matching these queries by just checking the response time. Sensitive information can be obtained this way, for example by using queries like: “Firing John Doe”, “acquiring company X”, “chemical compound Y”, etc.

3. In addition, late binding often requires passing around the user credentials. In the standard implementation of this method, the user is required to pass their username and password to the search application, which in turn will use them to check each result. This is not only cumbersome (because of the need to re-enter credentials instead of leveraging integrated authentication) but also runs counter to the security policies of many IT departments.

In contrast, early binding can provide very good security, especially when gathering user groups at search time. A standard implementation takes the username and query at search time and uses an ACL such as Active Directory to identify what security group a user belongs to. In this case there will be no latency in terms of user permissions (but still some latency in terms of document ACLs). User permissions are often the most important part of the authorization process. For example, if an employee is fired, their access will be revoked immediately. Changes in the document ACLs are rarely time-sensitive, because these documents were already accessible in a near past with the old ACLs.

If these changes in document ACLs do happen to be critical, then the best approach is to combine the two security methods. Very few vendors (Vivisimo is one of them) support this most accurate solution. The advantage of using early binding in combination with late binding is that it guarantees that not many documents will have to be thrown away (only those whose security has changed in last X minutes or hours) and therefore guarantee a consistent response time and a truly secure deployment.The ultimate solution is actually to do early binding with a very low synchronization time (in the second or minute range).

If you want to read more about security, you can download a technical white paper on it here. (Yes the marketing people make you register to download it, but unless you ask them to contact you, they won’t!)

Stay tuned for part two…

Trackbacks & Pingbacks

What is a Trackback? What is a Pingback?
  1. » Posted on CMS Watch - Blogosphere responds to Google’s appliance upgrade - My Webmaster News Blog wrote:

    […] A competitor asks whether GSA’s approach to document security remains too trivial (all caveats about casting stones apply)… […]

  2. Search Done Right » Blog Archive » How to Evaluate a Clustering Search Engine wrote:

    […] clustered within an acceptable response time. If user authentication is an issue (note discussion here), then the response time should include the time for the search engine to verify that the user can […]

  3. Search Done Right » Blog Archive » Enterprise Searching To Surpass Web Searching? wrote:

    […] this is changing. When administrators can rely on error-free operations in terms of security, there is no reason, except for lack of budget, ambition, or imagination, to withhold the most […]

Discussion

  1. googfan wrote:

    What you failed to cover was how someone would actually deploy an early binding method. Early binding puts a massive deployment effort on the part of the consultant or IT department deploying the search system. In our search deployment and content management operation, we often find companies who strive to implement early binding-based deployments, but after weeks of “business analysis” and technical prototyping realize that its just not technically practical in their heterogeneous environment. You see, the real issue is that most (if not all) enterprises don’t have one nice ACL store that has the access information for all of their various content systems. If they did, this problem would be a no-brainer. The work of some security vendors like CA are trending toward this, but they’ll be the first to admit that adoption is very slow and rework high in order to implement centralized or homogeneous distributed policy stores.

  2. Jerome Pesenti wrote:

    This is a good point that I am going to address in more details in Part #2. The short answer is that for the most common repositories, the search vendor should be able to offer connectors handling the security framework(s) properly. Vivisimo Velocity for example provides out-of-the-box connectors to Unix & Windows file systems, Lotus Notes, Sharepoint, Documentum, Email servers, Email archives, etc. For each of these systems, the security framework is known and documented and our software takes care of collecting and mapping out the ACLs properly. The fact that they use different frameworks is not an issue. The only part left to the administrator is to figure what user ids apply (for example if the Lotus Notes ID is different from the Windows ID) in which case an extra call to a directory service might be required, but that’s also true of “late binding” anyway.

    For unknown or poorly documented repositories (likely to be crawled rather than connected to through an API), “late binding” might be the only practical solution but “early binding” should be used without difficulties with all others (often containing the majority of the enterprise content).

  3. Jan wrote:

    Hogwash.

    1) The answer to performance issues for late binding is ACL caching - and no, it’s different from early binding. With early binding I have no choice as to the validity period of the ACL info - with ACL caching I have ultimate control and can determine on a per-content-set basis how sensitive the ACL’s should be. For HR info I may well require a per-query check, on my published white papers I may be happy with the ACL’s sticking around for days at a time.

    2) Highly insecure??? Like anything, I suppose late binding could be implemented in a way that leaks information, but come on - most companies struggle with getting the ACL’s right on their most sensitive content, never mind this. Besides, the ACL check overhead for most systems is very low - I cannot imagine a user (with a stopwatch no less!) gaining any meaningful information this way. There are way too many variables here - this point is just scaremongering.

    3) I’m not sure where you get the “need to re-enter user credentials” from. Any reasonable implementation will rely on existing user credentials, be it from the portal or from SSO framework - since you must have a shared authorisation credentials system to make sense of the ACL’s in the first place why would you not use it??

  4. Jerome Pesenti wrote:

    1/ “Late binding” is checking if each search result can be accessed with certain credentials - so whatever cache you get is only valid for a given URL and a given user which will be very limited. “Late binding” is often implemented by doing a “HEAD” request to the actual page passing the username & password of the end-user, i.e., it doesn’t involve any ACLs (that’s actually one point of Google’s post: it can be deployed without having to figure out the underlying ACLs).

    2/ My post is a response to Google’s post claiming that “early binding” is not secure (because of latencies in the ACL indexing). My point is that in general it’s more secure than “late binding”. I gather from your first point that we actually somewhat agree here that “caching the ACL” is often not a big deal.

    I disagree with you with regard to the performance of “late binding” (which, again, is not really an ACL check). It can be (and often is) very expensive. In the case of Sharepoint & Email search for example, it often requires doing hundreds of network requests per query (see Mark Bennett’s explanation in the original post).

    3/ Re-entering user credential is not a strict requirement if you have implemented SSO across all your collections, but it’s very common for many security infrastructures (for example for people using Windows Integrated Authentication, see this post on the search appliance user group).

  5. Shad wrote:

    re: your timing attack on late binding.

    It only tells the user if content matching the query is available.
    This is only a problem if you have a case where very untrusted users
    can query against a higher security search index or a search index
    with multiple security levels.

    It would be very hard to build up a picture of the content of a single
    document using this technique. You don’t have an identifier to work
    off of. You cannot be sure that you are accessing the same document.

    Examples where this would be useful anyway:
    You suspect the government of developing mind reading dachshunds.
    If you searched for mind-reading-dachshunds and it took longer than
    a garbage query, then you had a good case that they exist.

    This timing attack would be quite interesting to try out. The search
    engine can skew results by actually padding out low result searches by
    doing another dummy search for something more common.

    To maintain a higher level of security; if you have low security
    users, they should not be allowed to query higher security search
    index.

    However….

    If you have mixed security level content and are using a single index
    for it, then you are already saying it’s okay for the user to know
    that content exists. But a low security level user isn’t allowed to
    see the whole thing. Content-for-pay would be a good example of this;
    you have links describing pay-content, but you can’t view it unless
    you pay.

    An interesting version of this would be where each result is actually
    an AJAX call to fill in the details. By using the actual content
    management system’s ability to check the user’s credentials (AJAX is
    client side) you prevent the search engine from having to have extra
    logic to manage credentials (and yet thing to audit). You reveal that
    a document exists, and even reveal it’s location. Ideal for a
    content-for-pay system. You advertise without giving away the
    information.

Leave a Comment