[Search-l] more than just interoperability
Fred Benenson
fred.benenson at gmail.com
Mon Jun 4 14:59:11 UTC 2007
Instead of publicly crawling the human indexes (del.icio.us / stumbleupon,
etc.) ourselves, why don't we have our users to do it? It's not a complete
work around, but might be an approach that works for a bit. Here's how I
envision it:
* A client side crawler (similar to yacy, but not targeting the entire web,
just metadata rich places) implemented through a Firefox extension or
Greasemonkey script.
* When a client visits a social network with valuable data (as determined by
a list managed by us) their local client makes a copy of all the data
delivered to their client side browser.
* The server can't tell the difference between a user surfing with the
extension or without the extension.
* That data is then meta-tagged and packaged properly locally and sent to
the Wikia Search servers from the client's machine.
* The Wikia Search servers then index and make sense of all of this data
culled from the various clients running the Wikia Search client / extension.
This way we're able to work around the bandwidth concerns that Yahoo and
company would have with us crawling their databases. And the data that we're
getting is merely stuff that is being browsed naturally, by live humans, so
it's likely of more value.
But bandwidth is obviously not just what they're concerned with. As has been
mentioned, these sites view these databases of useful human tagged
information as enormously valuable assets that give them a competitive edge.
So then it's a question of the "intellectual property" contained in those
databases. Now, I'm not sure if other networks do this, but Del.icio.us and
Flickr have Creative Commons license implementation. That means that a
particular user's stream of content that they've created (links, photos,
etc.) can be set for people to share it. I think this would be a perfect
opportunity for our distributed crawlers to take advantage of.
Thoughts?
Fred
On 6/3/07, Nitin Borwankar <nitin at borwankar.com> wrote:
>
> jer wrote:
>
> >> So, here it is: Getting data from existent social bookmarking
> >> services may be an option we should consider. Think of it -
> >> aggregating data from del.icio.us <http://del.icio.us>,
> >> stumbleupon, etc. Now, I can't imagine how we'd get Yahoo to
> >> give us the data from del.icio.us <http://del.icio.us>, but maybe
> >> there are other providers who would be willing to do this. Or
> >> perhaps we look at paying them for it, at least enough to cover
> >> their bandwidth and other overhead.
> >>
> >> Anybody got an ideas around this type of thing?
> >>
> >>
> >> Yeah.. its a good way to find the actual interest of the people thru
> >> social book marking, digg and many other social websites.. But it all
> >> matters whether they are ready to release data open to such open
> >> source search projects..
> >
> >
> > For the most part, all of those sites and all of that data *is* open,
> > it just needs to be intelligently crawled and indexed. They're great
> > seed sites for keeping a crawler fresh.
> >
> > Sure it would be nice to have it in a more digestible form, but it's
> > all there already :)
> >
> > Jer
>
>
> Hi All,
>
> I did some work for a university professor who is trying to tackle the
> problem that publishers of technical periodicals own the bibliography
> citations in articles. However individual researchers own the
> bibliographies of their own publications, so by aggregating the
> bibliographies of individual researchers one can build an alternate open
> source of bibliography data. Apply the same principle to del.ici.ous,
> stumbleupon etc data.
>
> Individuals who have data on these services can individually and
> voluntarily copy their data out of those systems and into any public
> aggregation
> of such data. Now that Yahoo has BBAuth - a single-login authentication
> service, one could build a single web page where such a volunteer
> individual could go and authorize the download of their del.ici.ous data
> into their account(s) on any other web service(s).
>
> There is no need to crawl the web pages and get into the arms race of IP
> blocking etc. that will naturally come up.
>
> The bigger picture here is that we as individuals own our own data and
> we should not let it be captive on web applications, rather we should be
> able to aggregate it wherever we choose - and if we should choose to do
> so we should be able to very simply push a few buttons and have *our own
> data* transferred between web applications.
>
>
> Nitin Borwankar.
>
>
>
> >------------------------------------------------------------------------
> >
> >_______________________________________________
> >Search-l mailing list
> >Search-l at wikia.com
> >http://lists.wikia.com/mailman/listinfo/search-l
> >Change options or unsubscribe:
> http://lists.wikia.com/mailman/options/search-l
> >
>
>
> --
>
>
> Nitin Borwankar
>
> http://walruscarpenter.wordpress.com Of shoes and ships and sealing
> wax of cabbages and kings
> http://greener.com Find, Learn, Act .... Greener, the search engine for
> the planet
> http://tagschema.com Implementation of tag database applications
>
> nitin at borwankar.com
> 510-872-7066
>
>
> _______________________________________________
> Search-l mailing list
> Search-l at wikia.com
> http://lists.wikia.com/mailman/listinfo/search-l
> Change options or unsubscribe:
> http://lists.wikia.com/mailman/options/search-l
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.wikia.com/pipermail/search-l/attachments/20070604/20945198/attachment.html
More information about the Search-l
mailing list