Tuesday, November 15, 2005

Technical challenges for XML harvesting

Some students and I have been experimenting with Google as an "out of the box" citation harvester while waiting for the Interarchive list to get set up. Here's what our research suggests so far:

* Google doesn't seem to pay much attention to code inside Web pages, even <meta> tags. I'm speculating that the reason Google and Yahoo can find Creative Commons XML buried in Web pages is due to a private agreement between CC and those search
engines. In the future, we should contact CC to confirm this and find out how hard that road is.

* The much-vaunted Google Search API doesn't offer much more than the normal advanced search options.

* The one hack that remains promising is to slip metadata into a link's anchor text--not the tag attributes, but the text you click on. I'm going to design a snippet of test code for this possible solution and ask folks with Web pages spidered by
Google to add such snippets to their pages (at least temporarily). That should give us a quick proof of concept to assess.



Post a Comment

<< Home