Interarchive: Technical challenges for XML harvesting

Tuesday, November 15, 2005

Technical challenges for XML harvesting

Some students and I have been experimenting with Google as an "out of the box" citation harvester while waiting for the Interarchive list to get set up. Here's what our research suggests so far:

* Google doesn't seem to pay much attention to code inside Web pages, even <meta> tags. I'm speculating that the reason Google and Yahoo can find Creative Commons XML buried in Web pages is due to a private agreement between CC and those search
engines. In the future, we should contact CC to confirm this and find out how hard that road is.

* The much-vaunted Google Search API doesn't offer much more than the normal advanced search options.

* The one hack that remains promising is to slip metadata into a link's anchor text--not the tag attributes, but the text you click on. I'm going to design a snippet of test code for this possible solution and ask folks with Web pages spidered by
Google to add such snippets to their pages (at least temporarily). That should give us a quick proof of concept to assess.

jon

About Me

Jon Ippolito is an artist, writer and curator born in Berkeley, California in 1962 who turned to making art after failing as an astrophysicist. After applying for what he thought was a position as a museum guard, Jon was hired in the curatorial department of the Guggenheim, New York, where in 1993 he curated Virtual Reality: An Emerging Medium and subsequent exhibitions that explore the intersection of contemporary art and new media. In 2002 Jon joined the faculty of the University of Maine's New Media Department, where with Joline Blais he co-founded Still Water, a lab devoted to studying and building creative networks. His writing on the cultural and aesthetic implications of new media has appeared in The Washington Post, Art Journal and numerous art magazines. Oh, and he doesn't post to Blogger blogs very often. More at three.org/ippolito.

Interarchive

Tuesday, November 15, 2005

Technical challenges for XML harvesting

0 Comments:

About Me

Previous Posts