ACAP: A way to make AJAX search-friendly?
November 15, 2006Google Search and other similar crawlers have a difficult time with AJAX applications: they were tuned for full-page-load style traditional web content and don’t adapt well for single-page web applications, where many times there are no well identified URLs for different content and getting to content in the first place is trickier than following a few links (did I mention Flash-based apps?). And still, for most of these sites getting into searches (i.e. exposure) is essential.
ACAP (Automated Content Access Protocol) is a new initiative from the international publishing community to turn the challenges facing the industry from web technologies (especially search) into opportunities in a win-win way, and as a side effect can help Web Applications out, too.
First a summary of the publishing angle.
Right now major search engines (especially Google) are crawling (retrieving), indexing and storing web content without being aware of the special needs of publishers. This leads to well documented cases when Google faces charges of massive copyright infringement for its activities. In some special cases Google stores a copy of content publicly available on the net, and later when the publisher changes its policy about a particular piece of content (going from public to a fee or membership based structure) the content is still present in the search results of Google and can be retrieved from it.
It must be underlined that Google (or other search engines) cannot be faulted (at least morally) for doing this since they are indexing millions of web sites automatically and it is impractical to implement special cases for certain specific sites at this scale. On the other hand, publishers do have their moral and legal rights to protect their intellectual property.
A special twist of the situation is that for publishers it is indeed important to be indexed: this helps them to generate major exposure for their content.
This creates a challenging case that seems to be a loose-loose situation: either search engines are constantly sued for copyright infringement (where they can fight back with fair use, etc.), or publishers instruct the search engines not to index their content at all (e.g. with robots.txt) that means lost exposure and thus lost revenue.
An important element of the case is that both parties want to cooperate so if there were a technical means of communicating the intentions/requirements of publishers, it could be converted into a win-win opportunity.
That’s where ACAP comes into the picture. The publishing community has decided to establish a standard way of communicating permissions information, that can be automatically adhered to by the search engine crawler.
Technically it is a challenge to implement such a solution. Although there are a handful of major search engines, there are literally millions of publications on the web that are created and maintained with a lot of different tools in varying structures. Thus such a solution has to be established that can integrate well with all these various platforms and technologies.
In a well designed solution search engines would be able to index even copyrighted material and during a web user search session return only contextual excerpts with proper attribution and in case of non-free content even pointers to how to access the full content, thus creating an invaluable means of dissemination of such content. This could be augmented with publisher-provided taxonomy and auxiliary information that would help the crawler to set the Page rank (or similar) of the particular material.
Now how can this be utilized for Web Applications?
Of special interest is the upcoming Web 2.0 and web application technologies that make it very difficult for crawlers to index content properly. ACAP could be extended in a way to solve this issue so that search engines would benefit from an easier to crawl content with better signal-to-noise ratio and publishers would be able to have fine-tuned search results.
A trivial way of making AJAX apps search-friendly is to fundamentally prohibit indexing the AJAX application itself and direct the crawler to an (otherwise “inaccessible”) area of the site where all the information that the content owner wants to be indexed is available in a search friendly form. An interesting twist is to use such an URL scheme, that when the web server detects that the incoming request is not a crawler, then it redirects the request (based on the URL) into the AJAX application.
This way crawlers would only receive content that is really valuable, they would be more effective with less effort and search results would improve significantly. Content owners would be able to make their AJAX sites searchable the way they like it.
Combined with ACAP (for IP protection) and with the probable extension of the protocol to include “nice permalinks” in the searchable content into the AJAX site this may be a good solution for search in the Web 2.0 era.
Would be a way cool project to work on the specification/implementation…

[...] Peter Illes has posted on handling search crawlers within
Ajaxian » ACAP: Making Ajax Search Friendly | November 17, 2006[...] Peter Illes has posted on handling search crawlers within an Ajax application. [...]
[...] Peter Illes has posted on handling search crawlers within
ACAP: Making Ajax Search Friendly | November 19, 2006[...] Peter Illes has posted on handling search crawlers within an Ajax application. [...]
So it's not RFC 2244 ACAP -- Application Configuration Access
Matthew Walker | November 20, 2006So it’s not RFC 2244 ACAP — Application Configuration Access Protocol? :)
Mat, You are right, it is not. There only a couple
piprog | November 20, 2006Mat,
You are right, it is not. There only a couple good-sounding four-letter words, so overloading is a daily business nowadays :)
Peter
[...] The article at piBlog » Blog Archive » ACAP:
ACAP: fixing what ain’t broke » Rudd-O | November 21, 2006[...] The article at piBlog » Blog Archive » ACAP: A way to make AJAX search-friendly? wants to spread the usage of a new protocol on the World Wide Web. This protocol, named ACAP, is supposed to be the cure of modern Web indexing problems. [...]