From sethf at sethf.com Fri Jun 1 02:55:38 2007 From: sethf at sethf.com (Seth Finkelstein) Date: Thu, 31 May 2007 22:55:38 -0400 Subject: [Search-l] _Telegraph_ - "Wikipedia aims to take on Google" Message-ID: <20070601025538.GA32128@sethf.com> [Disclaimer: I didn't write this. Yes, it's another one making the mistake of confusing Wikipedia vs. Wikia] http://www.telegraph.co.uk/news/main.jhtml?xml=/news/2007/05/31/nwiki31.xml Wikipedia aims to take on Google By Harry Wallop, Consumer Affairs Correspondent Last Updated: 6:49am BST 31/05/2007 # Can Wikipedia really challenge Google? Jimmy Wales, the founder of the phenomenally successful online encyclopaedia Wikipedia, plans to launch a search engine to rival Google by the end of this year. He has already launched Wikia, a company he hopes to use as a launchpad for an "open, thoughtful" search engine - one which he says will be better quality and more thoughtful than its rivals. Wikia offers internet users a collection of online communities they can edit. Everything from Star Wars to travel guides gets its own mini-website, hosted by Wikia. The latest project is more ambitious and has led to Mr Wales being described as "Google's worst nightmare". The new search engine will publish the design of its algorithm - the mathematical formula that determines what website comes top of a search. Experts say this could make it vulnerable to spammers. -- Seth Finkelstein Consulting Programmer http://sethf.com/ Infothought blog - http://sethf.com/infothought/blog/ Interview: http://sethf.com/essays/major/greplaw-interview.php From sethf at sethf.com Fri Jun 1 03:54:26 2007 From: sethf at sethf.com (Seth Finkelstein) Date: Thu, 31 May 2007 23:54:26 -0400 Subject: [Search-l] _Mahalo_ and co-opting Wikia Search Message-ID: <20070601035426.GA32273@sethf.com> [The following is from Danny Sullivan's article on the Mahalo launch] Mahalo Launches With Human-Crafted Search Results http://searchengineland.com/070530-180000.php What if Maholo somehow beat all the odds and seriously threatened Google? Doesn't that potentially weaken Mahalo, which is depending on Google to do the hard work of crawling the web and providing relevant results for all those tail terms that Mahalo won't target? Calacanis sees this as unlikely -- but that's also where Search Wikia -- project backed by Wikipedia's Jimmy Wales -- might come in. Wales is focused more on building an open-source crawling of the web that anyone could use (see Q&A With Jimmy Wales On Search Wikia for more on this). For that reason, Calacanis doesn't necessarily see himself as "beating" Wales to the punch with a new human-powered service and in fact sees the two projects as perhaps complimentary. "If his open source results are good and unique and better than Google's, we'll use them. We're defaulting to Google for long tail [query results] because they are the best search out there. I hope that he comes up with something great, because if it is open source, we'd have a great solution," Calacanis said. [end excerpt from Danny Sullivan's article on the Mahalo Launch] I know, Jason Calacanis also said the same thing on this list, but I cite the above to note that search experts are also considering the issue now. To me, the most significant of Calacanis' past stances is not the SEO-is-dead stuff. That's just fluff, the sort of thing they can do because A-listers have the privilege that whatever they happen to rant about, gets *heard*. It's like a radio talk-show host. People can attach entirely too much import to the day's show fodder. Rather, unremarked here, the put-ads-on-Wikipedia position is what I believe is most relevant. It's difficult to get hard data on these sorts of business propositions. It exists somewhere, but it's surrounded by a bodyguard of hypes. But it's a complicated matter as to how much one can build a product out of technology vs. marketing and social networks. For example, _Wikipedia_ is not much in terms of technology - it's almost all network effect. While Google gained a dominant position through better technology, that technology has been duplicated by Yahoo and Microsoft, but it's not doing them any good. Frankly, I shouldn't try to predict where any attempt to co-opt Wikia Search would end up, I'm way out of my area of expertise here. -- Seth Finkelstein Consulting Programmer http://sethf.com/ Infothought blog - http://sethf.com/infothought/blog/ Interview: http://sethf.com/essays/major/greplaw-interview.php From jeremie at jabber.org Fri Jun 1 06:10:45 2007 From: jeremie at jabber.org (jer) Date: Fri, 1 Jun 2007 01:10:45 -0500 Subject: [Search-l] more than just interoperability Message-ID: Peter Saint-Andre had a great recent blog post, which is very relevant to what I believe in for search: http://www.saint-andre.com/blog/2007-05.html#2007-05-30T15:53 Be Open Is interoperability enough? Principle Six of the Mozilla Manifesto reads as follows: The effectiveness of the Internet as a public resource depends upon interoperability (protocols, data formats, content), innovation and decentralized participation worldwide. That's great as far as it goes, but interoperability is not enough. AbiWord is interoperable with MS Word, email is interoperable with SMS through suitable gateways, Ghostscript is interoperable with Adobe Acrobat Reader, OpenOffice is interoperable with PowerPoint, Ogg Vorbis is interoperable with MP3 through various audio converters, and Pidgin is interoperable with AOL Instant Messenger. But the underlying protocols, data formats, and content are closed, proprietary, probably patent-encumbered, and under the control of large corporations and industry consortiums like Microsoft, Adobe, and AOL. The result? Text, music, video, and communications that are less free than they deserve to be, and an Internet that is less open than it needs to be for the continued viability of our open society. When we talk about protocols and data formats, we are talking about standards. Standards needs to be open. Sure, MS Word and PDF and PowerPoint and MP3 and AIM or MSN are de-facto "standards", but they are closed. By contrast, HTML and email and OpenData and Atom and Ogg Vorbis and Jabber are truly open technologies and open standards. The Mozilla Foundation can be a great force for good in the world by consistently adopting open standards in its projects, creating new Mozilla-based projects (or working with existing projects, such as Songbird and SamePlace) that use open standards, and working with groups like the Electronic Frontier Foundation, the W3C, the IETF, XIPH, and the XMPP Standards Foundation to develop and extend the range of open standards. The long-term health of the Internet is at stake. From aerik at thesylvans.com Fri Jun 1 06:36:21 2007 From: aerik at thesylvans.com (Aerik Sylvan) Date: Thu, 31 May 2007 23:36:21 -0700 Subject: [Search-l] more than just interoperability In-Reply-To: References: Message-ID: <355a36af0705312336v528c3aa3v1d8d43c112f1d375@mail.gmail.com> I've been thinking about something that is kind of tangential to this - one of the things we've discussed is getting a large amount of proactive human data - tags, or something like them. It would take a really large number of tags (or whatever) to be really useful. Hopefully something like millions of websites each tagged by hundreds of people with at least several tags. So a dataset of perhaps a billion records is easy to imagine. But, it's not easy to accumulate or process. Processing it is a technical hurdle which will be fun to tackle, but accumulating the data is a whole other matter. So, here it is: Getting data from existent social bookmarking services may be an option we should consider. Think of it - aggregating data from del.icio.us, stumbleupon, etc. Now, I can't imagine how we'd get Yahoo to give us the data from del.icio.us, but maybe there are other providers who would be willing to do this. Or perhaps we look at paying them for it, at least enough to cover their bandwidth and other overhead. Anybody got an ideas around this type of thing? Aerik -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.wikia.com/pipermail/search-l/attachments/20070531/a084f62d/attachment.html From vprajan at gmail.com Fri Jun 1 08:47:23 2007 From: vprajan at gmail.com (Pushparajan V) Date: Fri, 1 Jun 2007 14:17:23 +0530 Subject: [Search-l] more than just interoperability In-Reply-To: <355a36af0705312336v528c3aa3v1d8d43c112f1d375@mail.gmail.com> References: <355a36af0705312336v528c3aa3v1d8d43c112f1d375@mail.gmail.com> Message-ID: On 6/1/07, Aerik Sylvan wrote: > > I've been thinking about something that is kind of tangential to this - > one of the things we've discussed is getting a large amount of proactive > human data - tags, or something like them. It would take a really large > number of tags (or whatever) to be really useful. Hopefully something like > millions of websites each tagged by hundreds of people with at least several > tags. So a dataset of perhaps a billion records is easy to imagine. > > But, it's not easy to accumulate or process. Processing it is a technical > hurdle which will be fun to tackle, but accumulating the data is a whole > other matter. > > So, here it is: Getting data from existent social bookmarking services > may be an option we should consider. Think of it - aggregating data from > del.icio.us, stumbleupon, etc. Now, I can't imagine how we'd get Yahoo to > give us the data from del.icio.us, but maybe there are other providers who > would be willing to do this. Or perhaps we look at paying them for it, at > least enough to cover their bandwidth and other overhead. > > Anybody got an ideas around this type of thing? Yeah.. its a good way to find the actual interest of the people thru social book marking, digg and many other social websites.. But it all matters whether they are ready to release data open to such open source search projects.. Its always possible to take those data for a small sum or even for free if we have more user base and many gets involved in the project. But as you mentioned, del.icio.us, stumbleupon.. they must be very open to support this project and must think about getting money with a open source business model instead of being more private and having deals with big heads around.. Thats the difficult part i guess... According to me, Web 2.0 is not that open and users are getting caged with the some closed social networking sites.. Aerik > > _______________________________________________ > Search-l mailing list > Search-l at wikia.com > http://lists.wikia.com/mailman/listinfo/search-l > Change options or unsubscribe: > http://lists.wikia.com/mailman/options/search-l > -- Pushparajan V http://www.vprajan.org - - - - - - - - Know me: http://www.hackerkey.com/decrypt.php?hackerkey=v4sw57BCHJUY$hw3/5ln2pr6AFOPSck3ma4u7FLMSw7DTWXm6l6FGIKLRSU$i862NLJ0CAe6$t3b4en4a23Ns3MSr9g5AGO - - - - - - - - -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.wikia.com/pipermail/search-l/attachments/20070601/af3ff34e/attachment.html From jeremie at jabber.org Fri Jun 1 15:49:12 2007 From: jeremie at jabber.org (jer) Date: Fri, 1 Jun 2007 10:49:12 -0500 Subject: [Search-l] more than just interoperability In-Reply-To: References: <355a36af0705312336v528c3aa3v1d8d43c112f1d375@mail.gmail.com> Message-ID: <86FDA1CC-9217-4C27-BC88-E089B6EFCC43@jabber.org> > So, here it is: Getting data from existent social bookmarking > services may be an option we should consider. Think of it - > aggregating data from del.icio.us, stumbleupon, etc. Now, I can't > imagine how we'd get Yahoo to give us the data from del.icio.us, > but maybe there are other providers who would be willing to do > this. Or perhaps we look at paying them for it, at least enough to > cover their bandwidth and other overhead. > > Anybody got an ideas around this type of thing? > > Yeah.. its a good way to find the actual interest of the people > thru social book marking, digg and many other social websites.. But > it all matters whether they are ready to release data open to such > open source search projects.. For the most part, all of those sites and all of that data *is* open, it just needs to be intelligently crawled and indexed. They're great seed sites for keeping a crawler fresh. Sure it would be nice to have it in a more digestible form, but it's all there already :) Jer -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.wikia.com/pipermail/search-l/attachments/20070601/27c4d76c/attachment.html From aerik at thesylvans.com Fri Jun 1 15:59:06 2007 From: aerik at thesylvans.com (Aerik Sylvan) Date: Fri, 1 Jun 2007 08:59:06 -0700 Subject: [Search-l] more than just interoperability In-Reply-To: <86FDA1CC-9217-4C27-BC88-E089B6EFCC43@jabber.org> References: <355a36af0705312336v528c3aa3v1d8d43c112f1d375@mail.gmail.com> <86FDA1CC-9217-4C27-BC88-E089B6EFCC43@jabber.org> Message-ID: <355a36af0706010859u34e599ecs469c26393865a07@mail.gmail.com> On 6/1/07, jer wrote: > > So, here it is: Getting data from existent social bookmarking services > > may be an option we should consider. Think of it - aggregating data from > > del.icio.us, stumbleupon, etc. Now, I can't imagine how we'd get Yahoo > > to give us the data from del.icio.us, but maybe there are other > > providers who would be willing to do this. Or perhaps we look at paying > > them for it, at least enough to cover their bandwidth and other overhead. > > > > Anybody got an ideas around this type of thing? > > > Yeah.. its a good way to find the actual interest of the people thru > social book marking, digg and many other social websites.. But it all > matters whether they are ready to release data open to such open source > search projects.. > > > For the most part, all of those sites and all of that data *is* open, it > just needs to be intelligently crawled and indexed. They're great seed > sites for keeping a crawler fresh. > > Sure it would be nice to have it in a more digestible form, but it's all > there already :) > > I guess that's kind of what I was talking about - if you're stubborn and clever enough you can crawl (scrape) just about anything - but getting some data in a reasonably digestable form, with permission, would be huge... Aerik -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.wikia.com/pipermail/search-l/attachments/20070601/8296b9d3/attachment.html From peter.burden at gmail.com Fri Jun 1 23:01:41 2007 From: peter.burden at gmail.com (peter burden) Date: Sat, 02 Jun 2007 00:01:41 +0100 Subject: [Search-l] more than just interoperability In-Reply-To: <355a36af0706010859u34e599ecs469c26393865a07@mail.gmail.com> References: <355a36af0705312336v528c3aa3v1d8d43c112f1d375@mail.gmail.com> <86FDA1CC-9217-4C27-BC88-E089B6EFCC43@jabber.org> <355a36af0706010859u34e599ecs469c26393865a07@mail.gmail.com> Message-ID: <4660A555.7090401@gmail.com> Aerik Sylvan wrote: > > > On 6/1/07, *jer* > wrote: > >> So, here it is: Getting data from existent social >> bookmarking services may be an option we should consider. >> Think of it - aggregating data from del.icio.us >> , stumbleupon, etc. Now, I can't imagine >> how we'd get Yahoo to give us the data from del.icio.us >> , but maybe there are other providers who >> would be willing to do this. Or perhaps we look at paying >> them for it, at least enough to cover their bandwidth and >> other overhead. >> >> Anybody got an ideas around this type of thing? >> >> >> Yeah.. its a good way to find the actual interest of the people >> thru social book marking, digg and many other social websites.. >> But it all matters whether they are ready to release data open to >> such open source search projects.. > > For the most part, all of those sites and all of that data *is* > open, it just needs to be intelligently crawled and indexed. > They're great seed sites for keeping a crawler fresh. > > Sure it would be nice to have it in a more digestible form, but > it's all there already :) > > > I guess that's kind of what I was talking about - if you're stubborn > and clever enough you can crawl (scrape) just about anything - but > getting some data in a reasonably digestable form, with permission, > would be huge... I'd have concerns about the quality of the information. Once it became clear that this sort of "social" information was affecting rankings, which are important to commercial web sites, then a small business would find it very tempting to give good, positive/relevant ratings to their own pages and negative/irrelevant to those of competitors. The scale of the WWW is such that I cannot conceive of any community effort that would be able to police and resolve such actions. However the sites mentioned would be excellent sources of crawler seeds, although effective crawling of dynamic (Web 2.0) database/CMS driven sites poses some significant problems - especially if they're using Ajax. > > Aerik > ------------------------------------------------------------------------ > > _______________________________________________ > Search-l mailing list > Search-l at wikia.com > http://lists.wikia.com/mailman/listinfo/search-l > Change options or unsubscribe: http://lists.wikia.com/mailman/options/search-l From contact at tallstreet.com Sat Jun 2 20:08:58 2007 From: contact at tallstreet.com (Tall Street) Date: Sun, 3 Jun 2007 08:08:58 +1200 Subject: [Search-l] more than just interoperability In-Reply-To: <4660A555.7090401@gmail.com> References: <355a36af0705312336v528c3aa3v1d8d43c112f1d375@mail.gmail.com> <86FDA1CC-9217-4C27-BC88-E089B6EFCC43@jabber.org> <355a36af0706010859u34e599ecs469c26393865a07@mail.gmail.com> <4660A555.7090401@gmail.com> Message-ID: That is a problem, and this is happening alot under Googles system at the moment. People know Google ranks sites based on the link structure so people purchasing links or hiring SEO firms to boost their own ranking in Googles search engine. Its inevitable that people will promote themselves the best way to deal with it is to provide a fair system where everyone has an equal chance and have a feedback loop that takes input from the users to rerank and improve the quality, this essentially is what http://www.tallstreet.com/ does. Gary > I'd have concerns about the quality of the information. Once it became > clear that this sort of "social" information > was affecting rankings, which are important to commercial web sites, > then a small business would find it very > tempting to give good, positive/relevant ratings to their own pages and > negative/irrelevant to those of competitors. > The scale of the WWW is such that I cannot conceive of any community > effort that would be able to police and > resolve such actions. > > However the sites mentioned would be excellent sources of crawler seeds, > although effective crawling of > dynamic (Web 2.0) database/CMS driven sites poses some significant > problems - especially if they're using > Ajax. > > > > Aerik > > ------------------------------------------------------------------------ > > > > _______________________________________________ > > Search-l mailing list > > Search-l at wikia.com > > http://lists.wikia.com/mailman/listinfo/search-l > > Change options or unsubscribe: http://lists.wikia.com/mailman/options/search-l > > > _______________________________________________ > Search-l mailing list > Search-l at wikia.com > http://lists.wikia.com/mailman/listinfo/search-l > Change options or unsubscribe: http://lists.wikia.com/mailman/options/search-l > From jason at calacanis.com Sat Jun 2 21:06:48 2007 From: jason at calacanis.com (Jason Calacanis) Date: Sat, 2 Jun 2007 14:06:48 -0700 Subject: [Search-l] more than just interoperability In-Reply-To: <4660A555.7090401@gmail.com> References: <355a36af0705312336v528c3aa3v1d8d43c112f1d375@mail.gmail.com> <86FDA1CC-9217-4C27-BC88-E089B6EFCC43@jabber.org> <355a36af0706010859u34e599ecs469c26393865a07@mail.gmail.com> <4660A555.7090401@gmail.com> Message-ID: <70b3cf150706021406x33ee2024id44552ce2303fedb@mail.gmail.com> On 6/1/07, peter burden wrote: > However the sites mentioned would be excellent sources of crawler seeds, > although effective crawling of > dynamic (Web 2.0) database/CMS driven sites poses some significant > problems - especially if they're using > Ajax. Peter, The even bigger problem is that the folks who have the best information to crawl--folks like delicious and google--do not allow metasearch and would take action if you sucked their data into another dataset that competed with them. They would, of course, have a very good point: it's not very fair to build a business overnight by indexing their information. Folks have been trying to do this to Craigslist and Craig Newmark has blocked them. The DMOZ is open, but their data isn't clean enough at this point to be of any real value. No fault to the DMOZ editors, but due to--from what they tell me--AOL ignoring them/not supporting them. For the record, I tried to buy DMOZ from AOL when I was leaving and they had no interest in selling it. best regards, Jason ----------- http://www.mahalo.com http://www.calacanis.com From aerik at thesylvans.com Sat Jun 2 22:13:37 2007 From: aerik at thesylvans.com (Aerik Sylvan) Date: Sat, 2 Jun 2007 15:13:37 -0700 Subject: [Search-l] Fwd: more than just interoperability In-Reply-To: <355a36af0706021513i60b04785mf8d89af02c0a55ad@mail.gmail.com> References: <355a36af0705312336v528c3aa3v1d8d43c112f1d375@mail.gmail.com> <86FDA1CC-9217-4C27-BC88-E089B6EFCC43@jabber.org> <355a36af0706010859u34e599ecs469c26393865a07@mail.gmail.com> <4660A555.7090401@gmail.com> <70b3cf150706021406x33ee2024id44552ce2303fedb@mail.gmail.com> <355a36af0706021513i60b04785mf8d89af02c0a55ad@mail.gmail.com> Message-ID: <355a36af0706021513p6a818a4euf09f00ef2a9800e0@mail.gmail.com> On 6/2/07, Jason Calacanis wrote: > > > The even bigger problem is that the folks who have the best > information to crawl--folks like delicious and google--do not allow > metasearch and would take action if you sucked their data into another > dataset that competed with them. > > They would, of course, have a very good point: it's not very fair to > build a business overnight by indexing their information. Folks have > been trying to do this to Craigslist and Craig Newmark has blocked > them. Exactly the point I'm making. There's a tremendous amount of pretty good human generated data out there ( I would argue that, at a minimum, the categorization of urls in dmoz is of some value, anyway), and integrating it would be a great way to get a decent signal to noise ratio. I think the wikia search has the capability to create some novel and interesting filtering and weighting mechanisms, but all by our lonesome, it will be difficult and timeconsuming to develop a really interesting dataset. So, since Wikia is a for-profit venture, perhaps it makes sense to look into licensing some data from closed providers (stumbleupon, for instance - since del.icio.us/yahoo is unlikely to want to feed a competing search engine). Aerik -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.wikia.com/pipermail/search-l/attachments/20070602/b10481c1/attachment.html From jason at calacanis.com Sat Jun 2 22:37:12 2007 From: jason at calacanis.com (Jason Calacanis) Date: Sat, 2 Jun 2007 15:37:12 -0700 Subject: [Search-l] Fwd: more than just interoperability In-Reply-To: <355a36af0706021513p6a818a4euf09f00ef2a9800e0@mail.gmail.com> References: <355a36af0705312336v528c3aa3v1d8d43c112f1d375@mail.gmail.com> <86FDA1CC-9217-4C27-BC88-E089B6EFCC43@jabber.org> <355a36af0706010859u34e599ecs469c26393865a07@mail.gmail.com> <4660A555.7090401@gmail.com> <70b3cf150706021406x33ee2024id44552ce2303fedb@mail.gmail.com> <355a36af0706021513i60b04785mf8d89af02c0a55ad@mail.gmail.com> <355a36af0706021513p6a818a4euf09f00ef2a9800e0@mail.gmail.com> Message-ID: <70b3cf150706021537l6b099f1dt64b8c97a00cda6a2@mail.gmail.com> > So, since Wikia is a for-profit venture, perhaps it makes sense to look into > licensing some data from closed providers (stumbleupon, for instance - since > del.icio.us/yahoo is unlikely to want to feed a competing search engine). > Aerik Chances of getting EBAY or Yahoo to give up StumbleUpon or Delicious data are very, very low--like NFW low. They paid millions for those services and see them as major competitive advantages for their multi-billion dollar businesses--they won't give that kind of ammo to someone as "dangerous" as Jimmy Wales. best j --------------------- Jason McCabe Calacanis www.mahalo.com From aerik at thesylvans.com Sat Jun 2 23:27:37 2007 From: aerik at thesylvans.com (Aerik Sylvan) Date: Sat, 2 Jun 2007 16:27:37 -0700 Subject: [Search-l] Fwd: more than just interoperability In-Reply-To: <70b3cf150706021537l6b099f1dt64b8c97a00cda6a2@mail.gmail.com> References: <355a36af0705312336v528c3aa3v1d8d43c112f1d375@mail.gmail.com> <86FDA1CC-9217-4C27-BC88-E089B6EFCC43@jabber.org> <355a36af0706010859u34e599ecs469c26393865a07@mail.gmail.com> <4660A555.7090401@gmail.com> <70b3cf150706021406x33ee2024id44552ce2303fedb@mail.gmail.com> <355a36af0706021513i60b04785mf8d89af02c0a55ad@mail.gmail.com> <355a36af0706021513p6a818a4euf09f00ef2a9800e0@mail.gmail.com> <70b3cf150706021537l6b099f1dt64b8c97a00cda6a2@mail.gmail.com> Message-ID: <355a36af0706021627x2fe804dmbfbf110a7d1fb5d3@mail.gmail.com> Hmm... I'd missed that stumbleupon was owned by ebay. I guess I missed the social bookmarking boom. Maybe I still have time to ride the community driven directory/search boom. Well... I wonder if there are second tier partners that would make sense, then. And I wonder - Ebay is not in the search market - perhaps, for money, they would license their data. Here's another thought: Try to get into more markets (and thus aggregate more data) by presenting several faces of wikia-search. A social bookmarking service for instance. But I still think that - infeasible as it may be - the greatest near term chances to get results more relevant than what Google returns is to aggregate lots moref human generated data. Some Wikia can get directly from users, but only some. So, Jason, I was really looking forward to hearing your thoughts on my other point - constantly pulling in fresh data and resisting the "entrenched results" paradigm...? Best Regards, Aerik On 6/2/07, Jason Calacanis wrote: > > > So, since Wikia is a for-profit venture, perhaps it makes sense to look > into > > licensing some data from closed providers (stumbleupon, for instance - > since > > del.icio.us/yahoo is unlikely to want to feed a competing search > engine). > > Aerik > > Chances of getting EBAY or Yahoo to give up StumbleUpon or Delicious > data are very, very low--like NFW low. They paid millions for those > services and see > them as major competitive advantages for their multi-billion dollar > businesses--they won't give that kind of ammo to someone as > "dangerous" as Jimmy Wales. > > best j > --------------------- > Jason McCabe Calacanis > www.mahalo.com > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.wikia.com/pipermail/search-l/attachments/20070602/24d590c7/attachment.html From aerik at thesylvans.com Sat Jun 2 23:36:32 2007 From: aerik at thesylvans.com (Aerik Sylvan) Date: Sat, 2 Jun 2007 16:36:32 -0700 Subject: [Search-l] Fwd: more than just interoperability In-Reply-To: <714251711-1180827010-cardhu_decombobulator_blackberry.rim.net-176205837-@bxe034.bisx.prod.on.blackberry> References: <86FDA1CC-9217-4C27-BC88-E089B6EFCC43@jabber.org> <355a36af0706010859u34e599ecs469c26393865a07@mail.gmail.com> <4660A555.7090401@gmail.com> <70b3cf150706021406x33ee2024id44552ce2303fedb@mail.gmail.com> <355a36af0706021513i60b04785mf8d89af02c0a55ad@mail.gmail.com> <355a36af0706021513p6a818a4euf09f00ef2a9800e0@mail.gmail.com> <70b3cf150706021537l6b099f1dt64b8c97a00cda6a2@mail.gmail.com> <355a36af0706021627x2fe804dmbfbf110a7d1fb5d3@mail.gmail.com> <714251711-1180827010-cardhu_decombobulator_blackberry.rim.net-176205837-@bxe034.bisx.prod.on.blackberry> Message-ID: <355a36af0706021636q7a27efecn45aee4a9e39038a7@mail.gmail.com> On 6/2/07, Jason McCabe Calacanis wrote: > > Well, the approach at mahalo is to use humans--paid ones with health care > and all--to hand write the first 10,000. Check it out at mahalo.com. > > > Well, yes, I got that much. Are you going to constantly revisit those 10,000, and look for "better" results? That'd be fine, you know, but but I'd imagine it would come at a high overhead. Don't get me wrong - I'm a big fan of benefits and health care, and personally I have a few doubts about the how well volunteer developers working for a for profit entity is going to work out - I'm just trying to ask questions that get behind the marketing into some of the actual meat of your search engine. On the surface, it sounds a lot like dmoz. Best Regards, Aerik -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.wikia.com/pipermail/search-l/attachments/20070602/005e1451/attachment.html From jason at calacanis.com Sat Jun 2 23:29:37 2007 From: jason at calacanis.com (=?utf-8?B?SmFzb24gTWNDYWJlIENhbGFjYW5pcw==?=) Date: Sat, 2 Jun 2007 23:29:37 +0000 Subject: [Search-l] Fwd: more than just interoperability In-Reply-To: <355a36af0706021627x2fe804dmbfbf110a7d1fb5d3@mail.gmail.com> References: <355a36af0705312336v528c3aa3v1d8d43c112f1d375@mail.gmail.com> <86FDA1CC-9217-4C27-BC88-E089B6EFCC43@jabber.org> <355a36af0706010859u34e599ecs469c26393865a07@mail.gmail.com> <4660A555.7090401@gmail.com> <70b3cf150706021406x33ee2024id44552ce2303fedb@mail.gmail.com> <355a36af0706021513i60b04785mf8d89af02c0a55ad@mail.gmail.com> <355a36af0706021513p6a818a4euf09f00ef2a9800e0@mail.gmail.com> <70b3cf150706021537l6b099f1dt64b8c97a00cda6a2@mail.gmail.com><355a36af0706021627x2fe804dmbfbf110a7d1fb5d3@mail.gmail.com> Message-ID: <714251711-1180827010-cardhu_decombobulator_blackberry.rim.net-176205837-@bxe034.bisx.prod.on.blackberry> Well, the approach at mahalo is to use humans--paid ones with health care and all--to hand write the first 10,000. Check it out at mahalo.com. Figuring out what comes after 10,000--social, machines, combo--is up for grabs I think. J --------------- Jason at Calacanis.com | 310-456-4900 www.calacanis.com -----Original Message----- From: "Aerik Sylvan" Date: Sat, 2 Jun 2007 16:27:37 To:search-l at wikia.com, jason at calacanis.com Subject: Re: [Search-l] Fwd: more than just interoperability Hmm... I'd missed that stumbleupon was owned by ebay.  I guess I missed the social bookmarking boom.  Maybe I still have time to ride the community driven directory/search boom.

Well... I wonder if there are second tier partners that would make sense, then.  And I wonder - Ebay is not in the search market - perhaps, for money, they would license their data.

Here's another thought:  Try to get into more markets (and thus aggregate more data) by presenting several faces of wikia-search.  A social bookmarking service for instance.

But I still think that - infeasible as it may be - the greatest near term chances to get results more relevant than what Google returns is to aggregate lots moref human generated data.  Some Wikia can get directly from users, but only some.

So, Jason, I was really looking forward to hearing your thoughts on my other point - constantly pulling in fresh data and resisting the "entrenched results" paradigm...?

Best Regards,
Aerik

On 6/2/07, Jason Calacanis <jason at calacanis.com> wrote:
> So, since Wikia is a for-profit venture, perhaps it makes sense to look into
> licensing some data from closed providers (stumbleupon, for instance - since
> del.icio.us/yahoo is unlikely to want to feed a competing search engine).
> Aerik

Chances of getting EBAY or Yahoo to give up StumbleUpon or Delicious
data are very, very low--like NFW low. They paid millions for those
services and see
them as major competitive advantages for their multi-billion dollar
businesses--they won't give that kind of ammo to someone as
"dangerous" as Jimmy Wales.

best j
---------------------
Jason McCabe Calacanis
www.mahalo.com

From jason at calacanis.com Sat Jun 2 23:39:13 2007 From: jason at calacanis.com (=?utf-8?B?SmFzb24gTWNDYWJlIENhbGFjYW5pcw==?=) Date: Sat, 2 Jun 2007 23:39:13 +0000 Subject: [Search-l] Fwd: more than just interoperability In-Reply-To: <355a36af0706021636q7a27efecn45aee4a9e39038a7@mail.gmail.com> References: <86FDA1CC-9217-4C27-BC88-E089B6EFCC43@jabber.org> <355a36af0706010859u34e599ecs469c26393865a07@mail.gmail.com> <4660A555.7090401@gmail.com> <70b3cf150706021406x33ee2024id44552ce2303fedb@mail.gmail.com> <355a36af0706021513i60b04785mf8d89af02c0a55ad@mail.gmail.com> <355a36af0706021513p6a818a4euf09f00ef2a9800e0@mail.gmail.com> <70b3cf150706021537l6b099f1dt64b8c97a00cda6a2@mail.gmail.com> <355a36af0706021627x2fe804dmbfbf110a7d1fb5d3@mail.gmail.com> <714251711-1180827010-cardhu_decombobulator_blackberry.rim.net-176205837-@bxe034.bisx.prod.on.blackberry><355a36af0706021636q7a27efecn45aee4a9e39038a7@mail.gmail.com> Message-ID: <1609692912-1180827584-cardhu_decombobulator_blackberry.rim.net-1671855404-@bxe034.bisx.prod.on.blackberry> I believe ceos and management teams of tech firms aren't the only ones who should get paid. :) Editors do real work and should--if the choose--be paid for it. Working for free as a hobby is fine (ie wikipedia), but why anyone would work for free to make venture capitalists and ceos right is beyond me. Re the top 10k, we can keep them updated and make a living I think. :) J --------------- Jason at Calacanis.com | 310-456-4900 www.calacanis.com -----Original Message----- From: "Aerik Sylvan" Date: Sat, 2 Jun 2007 16:36:32 To:jason at calacanis.com Cc:search-l at wikia.com Subject: Re: [Search-l] Fwd: more than just interoperability
On 6/2/07, Jason McCabe Calacanis <jason at calacanis.com> wrote:
Well, the approach at mahalo is to use humans--paid ones with health care and all--to hand write the first 10,000. Check it out at mahalo.com.


Well, yes, I got that much.  Are you going to constantly revisit those 10,000, and look for "better" results?  That'd be fine, you know, but but I'd imagine it would come at a high overhead.  Don't get me wrong - I'm a big fan of benefits and health care, and personally I have a few doubts about the how well volunteer developers working for a for profit entity is going to work out - I'm just trying to ask questions that get behind the marketing into some of the actual meat of your search engine.  On the surface, it sounds a lot like dmoz.

Best Regards,
Aerik
From sethf at sethf.com Sun Jun 3 12:43:37 2007 From: sethf at sethf.com (Seth Finkelstein) Date: Sun, 3 Jun 2007 08:43:37 -0400 Subject: [Search-l] "directory" vs. "search engine" In-Reply-To: <1609692912-1180827584-cardhu_decombobulator_blackberry.rim.net-1671855404-@bxe034.bisx.prod.on.blackberry> References: <86FDA1CC-9217-4C27-BC88-E089B6EFCC43@jabber.org> <355a36af0706010859u34e599ecs469c26393865a07@mail.gmail.com> <4660A555.7090401@gmail.com> <70b3cf150706021406x33ee2024id44552ce2303fedb@mail.gmail.com> <355a36af0706021513i60b04785mf8d89af02c0a55ad@mail.gmail.com> <355a36af0706021513p6a818a4euf09f00ef2a9800e0@mail.gmail.com> <70b3cf150706021537l6b099f1dt64b8c97a00cda6a2@mail.gmail.com> <355a36af0706021627x2fe804dmbfbf110a7d1fb5d3@mail.gmail.com> <1609692912-1180827584-cardhu_decombobulator_blackberry.rim.net-1671855404-@bxe034.bisx.prod.on.blackberry> Message-ID: <20070603124337.GA11441@sethf.com> > From: "Aerik Sylvan" > On the surface, it sounds a lot like dmoz. Ah, but I think the key difference in Mahalo vs DMOZ is in a halfway reasonable monetization strategy (cream-skim the top search terms). This is not *comprehensive* coverage of the Web, which I think is tripping people up when it comes to discussing directory vs. general search engine. Rather, it's *focus*, on the specific segment likely to be most profitable. Yes, it's really more of a "directory", not a "search engine". But the other side of that is it's a directory which is optimized to work both "in" and "out" with a search engine, and an eye towards profitability. Which is something of a twist on the usual directory concept (which usually starts from a taxonomy and concerns itself with breadth). On Sat, Jun 02, 2007 at 11:39:13PM +0000, Jason McCabe Calacanis wrote: > I believe ceos and management teams of tech firms aren't the only > ones who should get paid. :) > > Editors do real work and should--if the choose--be paid for > it. Working for free as a hobby is fine (ie wikipedia), but why > anyone would work for free to make venture capitalists and ceos > right is beyond me. For the joy and happiness, the *community*, of course. I think part of what Y. Benkler is analyzing in his infamous book, though not put so bluntly, is this: If you have 100K to hire workers, you can put 10K each towards 10 people (and after benefits and overhead, pay them around 5K each total), Or, put 100K towards a really good marketing flack who will go around trying to convince 10 people in the entire world to WORK FOR FREE, because gosh golly they're contributing to A New Era, and showing those elitist priests up there that citizen-amateurs can do a job without pay that's every bit as good as paid professionals. http://ireadnews.net/2007/06/01/britannica-rules-not-necessarily-with-the-advent-of-wikipedia/ "Theresa is one of a dedicated band of unpaid volunteers who act as the guardians of the Wikipedia online encyclopaedia - one of the great internet successes of the decade. The passion for sharing knowledge that is apparent in her teaching - she loves her Year 1s and 2s because, she says, they are full of wonder and enthusiasm for learning - translates after hours into tireless work for Wikipedia. So firmly does she believe in its mission - to accumulate every piece of knowledge in the world in one easily accessible place - that she spends hours in the evening, or in the early morning before leaving for work, checking the site, editing pages and helping contributors." But I get in trouble when I talk about that. At least in the wrong place. Anyway, I may be projecting, but I think the membership of this list skews more towards those on the code-developer end, and interested parties keeping an eye on the project, rather than those who will be doing grunt work of writing specific results. -- Seth Finkelstein Consulting Programmer http://sethf.com/ Infothought blog - http://sethf.com/infothought/blog/ Interview: http://sethf.com/essays/major/greplaw-interview.php From wsurowiec at gmail.com Sun Jun 3 18:34:50 2007 From: wsurowiec at gmail.com (William Surowiec) Date: Sun, 03 Jun 2007 14:34:50 -0400 Subject: [Search-l] [relevancy of search results] Message-ID: <466309CA.70809@gmail.com> An HTML attachment was scrubbed... URL: http://lists.wikia.com/pipermail/search-l/attachments/20070603/fd242c88/attachment.html From vprajan at gmail.com Sun Jun 3 19:42:11 2007 From: vprajan at gmail.com (Pushparajan V) Date: Mon, 4 Jun 2007 01:12:11 +0530 Subject: [Search-l] more than just interoperability In-Reply-To: <86FDA1CC-9217-4C27-BC88-E089B6EFCC43@jabber.org> References: <355a36af0705312336v528c3aa3v1d8d43c112f1d375@mail.gmail.com> <86FDA1CC-9217-4C27-BC88-E089B6EFCC43@jabber.org> Message-ID: Terms for del.icio.us: http://del.icio.us/help/terms Terms of stumbleupon: http://www.stumbleupon.com/terms.html If our project starts using them, we are sued for sure.. :) On 6/1/07, jer wrote: > > So, here it is: Getting data from existent social bookmarking services > > may be an option we should consider. Think of it - aggregating data from > > del.icio.us, stumbleupon, etc. Now, I can't imagine how we'd get Yahoo > > to give us the data from del.icio.us, but maybe there are other > > providers who would be willing to do this. Or perhaps we look at paying > > them for it, at least enough to cover their bandwidth and other overhead. > > > > Anybody got an ideas around this type of thing? > > > Yeah.. its a good way to find the actual interest of the people thru > social book marking, digg and many other social websites.. But it all > matters whether they are ready to release data open to such open source > search projects.. > > > For the most part, all of those sites and all of that data *is* open, it > just needs to be intelligently crawled and indexed. They're great seed > sites for keeping a crawler fresh. > > Sure it would be nice to have it in a more digestible form, but it's all > there already :) > > Jer > -- Pushparajan V http://www.vprajan.org - - - - - - - - Know me: http://www.hackerkey.com/decrypt.php?hackerkey=v4sw57BCHJUY$hw3/5ln2pr6AFOPSck3ma4u7FLMSw7DTWXm6l6FGIKLRSU$i862NLJ0CAe6$t3b4en4a23Ns3MSr9g5AGO - - - - - - - - -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.wikia.com/pipermail/search-l/attachments/20070604/4f363c51/attachment.html From nitin at borwankar.com Sun Jun 3 19:59:07 2007 From: nitin at borwankar.com (Nitin Borwankar) Date: Sun, 03 Jun 2007 12:59:07 -0700 Subject: [Search-l] more than just interoperability In-Reply-To: <86FDA1CC-9217-4C27-BC88-E089B6EFCC43@jabber.org> References: <355a36af0705312336v528c3aa3v1d8d43c112f1d375@mail.gmail.com> <86FDA1CC-9217-4C27-BC88-E089B6EFCC43@jabber.org> Message-ID: <46631D8B.1010800@borwankar.com> jer wrote: >> So, here it is: Getting data from existent social bookmarking >> services may be an option we should consider. Think of it - >> aggregating data from del.icio.us , >> stumbleupon, etc. Now, I can't imagine how we'd get Yahoo to >> give us the data from del.icio.us , but maybe >> there are other providers who would be willing to do this. Or >> perhaps we look at paying them for it, at least enough to cover >> their bandwidth and other overhead. >> >> Anybody got an ideas around this type of thing? >> >> >> Yeah.. its a good way to find the actual interest of the people thru >> social book marking, digg and many other social websites.. But it all >> matters whether they are ready to release data open to such open >> source search projects.. > > > For the most part, all of those sites and all of that data *is* open, > it just needs to be intelligently crawled and indexed. They're great > seed sites for keeping a crawler fresh. > > Sure it would be nice to have it in a more digestible form, but it's > all there already :) > > Jer Hi All, I did some work for a university professor who is trying to tackle the problem that publishers of technical periodicals own the bibliography citations in articles. However individual researchers own the bibliographies of their own publications, so by aggregating the bibliographies of individual researchers one can build an alternate open source of bibliography data. Apply the same principle to del.ici.ous, stumbleupon etc data. Individuals who have data on these services can individually and voluntarily copy their data out of those systems and into any public aggregation of such data. Now that Yahoo has BBAuth - a single-login authentication service, one could build a single web page where such a volunteer individual could go and authorize the download of their del.ici.ous data into their account(s) on any other web service(s). There is no need to crawl the web pages and get into the arms race of IP blocking etc. that will naturally come up. The bigger picture here is that we as individuals own our own data and we should not let it be captive on web applications, rather we should be able to aggregate it wherever we choose - and if we should choose to do so we should be able to very simply push a few buttons and have *our own data* transferred between web applications. Nitin Borwankar. >------------------------------------------------------------------------ > >_______________________________________________ >Search-l mailing list >Search-l at wikia.com >http://lists.wikia.com/mailman/listinfo/search-l >Change options or unsubscribe: http://lists.wikia.com/mailman/options/search-l > -- Nitin Borwankar http://walruscarpenter.wordpress.com Of shoes and ships and sealing wax of cabbages and kings http://greener.com Find, Learn, Act .... Greener, the search engine for the planet http://tagschema.com Implementation of tag database applications nitin at borwankar.com 510-872-7066 From jason at calacanis.com Sun Jun 3 21:17:19 2007 From: jason at calacanis.com (Jason Calacanis) Date: Sun, 3 Jun 2007 14:17:19 -0700 Subject: [Search-l] "directory" vs. "search engine" In-Reply-To: <20070603124337.GA11441@sethf.com> References: <355a36af0706010859u34e599ecs469c26393865a07@mail.gmail.com> <4660A555.7090401@gmail.com> <70b3cf150706021406x33ee2024id44552ce2303fedb@mail.gmail.com> <355a36af0706021513i60b04785mf8d89af02c0a55ad@mail.gmail.com> <355a36af0706021513p6a818a4euf09f00ef2a9800e0@mail.gmail.com> <70b3cf150706021537l6b099f1dt64b8c97a00cda6a2@mail.gmail.com> <355a36af0706021627x2fe804dmbfbf110a7d1fb5d3@mail.gmail.com> <1609692912-1180827584-cardhu_decombobulator_blackberry.rim.net-1671855404-@bxe034.bisx.prod.on.blackberry> <20070603124337.GA11441@sethf.com> Message-ID: <70b3cf150706031417j26ac3eefn516c114267f7dae4@mail.gmail.com> On 6/3/07, Seth Finkelstein wrote: > Ah, but I think the key difference in Mahalo vs DMOZ is > in a halfway reasonable monetization strategy (cream-skim the > top search terms). This is not *comprehensive* coverage of the Web, > which I think is tripping people up when it comes to discussing > directory vs. general search engine. Rather, it's *focus*, on the > specific segment likely to be most profitable. Yes, it's really > more of a "directory", not a "search engine". But the other side of > that is it's a directory which is optimized to work both "in" and "out" > with a search engine, and an eye towards profitability. Which is something > of a twist on the usual directory concept (which usually starts from a > taxonomy and concerns itself with breadth). That is a very astute point Seth. We are doing the top 10,000 english search terms, and we are using a search format/metaphor/design. We do have categories, and you can navigate in a DMOZ-like way, so I like to call Mahalo.com a "search service." > > it. Working for free as a hobby is fine (ie wikipedia), but why > > anyone would work for free to make venture capitalists and ceos > > right is beyond me. > For the joy and happiness, the *community*, of course. I think > part of what Y. Benkler is analyzing in his infamous book, though not put > so bluntly, is this: If you have 100K to hire workers, you can put 10K > each towards 10 people (and after benefits and overhead, pay them around > 5K each total), Or, put 100K towards a really good marketing flack > who will go around trying to convince 10 people in the entire world to > WORK FOR FREE, because gosh golly they're contributing to A New Era, > and showing those elitist priests up there that citizen-amateurs can > do a job without pay that's every bit as good as paid professionals. I've debated Benkler on this point, and at Wikimania in Boston this summer one of the speakers (before or after Benkler) made a point of demonstrating that folks involved in open source software projects participated because they thought their participation would get them some sort of financial return down the road (or something to that effect--anyone remember?). Anyway, at the end of the day everyone needs to eat, and when your hobby moves from being part time to full-time the rubber meets the road. In fact, look and Jimmy and Angela who did amazing free work on the Wikipedia for years and years and are now working on a for-profit company--Wikia--backed by the most aggressive form of capital in the world: venture capital. If the top folks from Wikipedia have left to "swing for the fences" at a venture backed company that tells you something about Benkler's theories now doesn't it? > But I get in trouble when I talk about that. At least in the > wrong place. Anyway, I may be projecting, but I think the membership > of this list skews more towards those on the code-developer end, and > interested parties keeping an eye on the project, rather than those > who will be doing grunt work of writing specific results. Exactly. That's the part I hate. Developers can get paid, administrators can get paid, but editors don't!? What if Theresa in your example got paid for doing good? Would that be such a bad thing? That's what we're trying to do at Mahalo: let everyone "get a taste" not just the management teams of venture-backed companies. [ Note: I'm not saying Wikipedia should move to a paid model, but I think if Wikipedia allowed OPT-IN advertising they could hire 10-25 editors f/t to work from home and administer the system. It would be a start. ] best regards and Mahalo ;-) for the amazing feedback Seth... you're like having a free management consultant on 24-hour duty!!! Jason www.mahalo.com From aerik at thesylvans.com Mon Jun 4 04:13:20 2007 From: aerik at thesylvans.com (Aerik Sylvan) Date: Sun, 3 Jun 2007 21:13:20 -0700 Subject: [Search-l] "directory" vs. "search engine" In-Reply-To: <70b3cf150706031417j26ac3eefn516c114267f7dae4@mail.gmail.com> References: <4660A555.7090401@gmail.com> <70b3cf150706021406x33ee2024id44552ce2303fedb@mail.gmail.com> <355a36af0706021513i60b04785mf8d89af02c0a55ad@mail.gmail.com> <355a36af0706021513p6a818a4euf09f00ef2a9800e0@mail.gmail.com> <70b3cf150706021537l6b099f1dt64b8c97a00cda6a2@mail.gmail.com> <355a36af0706021627x2fe804dmbfbf110a7d1fb5d3@mail.gmail.com> <1609692912-1180827584-cardhu_decombobulator_blackberry.rim.net-1671855404-@bxe034.bisx.prod.on.blackberry> <20070603124337.GA11441@sethf.com> <70b3cf150706031417j26ac3eefn516c114267f7dae4@mail.gmail.com> Message-ID: <355a36af0706032113wd41df06n64b92dd7d2955410@mail.gmail.com> On 6/3/07, Jason Calacanis wrote: > > I've debated Benkler on this point, and at Wikimania in Boston this > summer one of the speakers (before or after Benkler) made a point of > demonstrating that folks involved in open source software projects > participated because they thought their participation would get them > some sort of financial return down the road (or something to that > effect--anyone remember?). Well, I think you get a mix of folks. I, for one, participate for several reasons: the possibility of future financial gain, the possibility of some degree of fame/glory, and there's also an element of "help change the world". [ Note: I'm not saying Wikipedia should move to a paid model, but I > think if Wikipedia allowed OPT-IN advertising they could hire 10-25 > editors f/t to work from home and administer the system. It would be a > start. ] > > Hmm... I think that would blow up the model. I doubt the editors are in it for financial gain - what would be the path? It's a little easier to imagine financial gain for developers. I don't think paying editors is a bad idea, but I think if you pay some of the editors, it will be a disincentive for a lot of others would would otherwise volunteer. "I don't need to worry about this - they've got John Doe on the payroll, he'll take care of it." Aerik -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.wikia.com/pipermail/search-l/attachments/20070603/6755576e/attachment.html From vprajan at gmail.com Mon Jun 4 12:40:24 2007 From: vprajan at gmail.com (Pushparajan V) Date: Mon, 4 Jun 2007 18:10:24 +0530 Subject: [Search-l] User powered search engine Message-ID: I found this search site interesting. It has a good interface for the user. http://sproose.com/ -- Pushparajan V http://www.vprajan.org - - - - - - - - Know me: http://www.hackerkey.com/decrypt.php?hackerkey=v4sw57BCHJUY$hw3/5ln2pr6AFOPSck3ma4u7FLMSw7DTWXm6l6FGIKLRSU$i862NLJ0CAe6$t3b4en4a23Ns3MSr9g5AGO - - - - - - - - -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.wikia.com/pipermail/search-l/attachments/20070604/11f2054e/attachment.html From fred.benenson at gmail.com Mon Jun 4 14:59:11 2007 From: fred.benenson at gmail.com (Fred Benenson) Date: Mon, 4 Jun 2007 10:59:11 -0400 Subject: [Search-l] more than just interoperability In-Reply-To: <46631D8B.1010800@borwankar.com> References: <355a36af0705312336v528c3aa3v1d8d43c112f1d375@mail.gmail.com> <86FDA1CC-9217-4C27-BC88-E089B6EFCC43@jabber.org> <46631D8B.1010800@borwankar.com> Message-ID: <8e447b720706040759n31ce078fle6244b39158fa73@mail.gmail.com> Instead of publicly crawling the human indexes (del.icio.us / stumbleupon, etc.) ourselves, why don't we have our users to do it? It's not a complete work around, but might be an approach that works for a bit. Here's how I envision it: * A client side crawler (similar to yacy, but not targeting the entire web, just metadata rich places) implemented through a Firefox extension or Greasemonkey script. * When a client visits a social network with valuable data (as determined by a list managed by us) their local client makes a copy of all the data delivered to their client side browser. * The server can't tell the difference between a user surfing with the extension or without the extension. * That data is then meta-tagged and packaged properly locally and sent to the Wikia Search servers from the client's machine. * The Wikia Search servers then index and make sense of all of this data culled from the various clients running the Wikia Search client / extension. This way we're able to work around the bandwidth concerns that Yahoo and company would have with us crawling their databases. And the data that we're getting is merely stuff that is being browsed naturally, by live humans, so it's likely of more value. But bandwidth is obviously not just what they're concerned with. As has been mentioned, these sites view these databases of useful human tagged information as enormously valuable assets that give them a competitive edge. So then it's a question of the "intellectual property" contained in those databases. Now, I'm not sure if other networks do this, but Del.icio.us and Flickr have Creative Commons license implementation. That means that a particular user's stream of content that they've created (links, photos, etc.) can be set for people to share it. I think this would be a perfect opportunity for our distributed crawlers to take advantage of. Thoughts? Fred On 6/3/07, Nitin Borwankar wrote: > > jer wrote: > > >> So, here it is: Getting data from existent social bookmarking > >> services may be an option we should consider. Think of it - > >> aggregating data from del.icio.us , > >> stumbleupon, etc. Now, I can't imagine how we'd get Yahoo to > >> give us the data from del.icio.us , but maybe > >> there are other providers who would be willing to do this. Or > >> perhaps we look at paying them for it, at least enough to cover > >> their bandwidth and other overhead. > >> > >> Anybody got an ideas around this type of thing? > >> > >> > >> Yeah.. its a good way to find the actual interest of the people thru > >> social book marking, digg and many other social websites.. But it all > >> matters whether they are ready to release data open to such open > >> source search projects.. > > > > > > For the most part, all of those sites and all of that data *is* open, > > it just needs to be intelligently crawled and indexed. They're great > > seed sites for keeping a crawler fresh. > > > > Sure it would be nice to have it in a more digestible form, but it's > > all there already :) > > > > Jer > > > Hi All, > > I did some work for a university professor who is trying to tackle the > problem that publishers of technical periodicals own the bibliography > citations in articles. However individual researchers own the > bibliographies of their own publications, so by aggregating the > bibliographies of individual researchers one can build an alternate open > source of bibliography data. Apply the same principle to del.ici.ous, > stumbleupon etc data. > > Individuals who have data on these services can individually and > voluntarily copy their data out of those systems and into any public > aggregation > of such data. Now that Yahoo has BBAuth - a single-login authentication > service, one could build a single web page where such a volunteer > individual could go and authorize the download of their del.ici.ous data > into their account(s) on any other web service(s). > > There is no need to crawl the web pages and get into the arms race of IP > blocking etc. that will naturally come up. > > The bigger picture here is that we as individuals own our own data and > we should not let it be captive on web applications, rather we should be > able to aggregate it wherever we choose - and if we should choose to do > so we should be able to very simply push a few buttons and have *our own > data* transferred between web applications. > > > Nitin Borwankar. > > > > >------------------------------------------------------------------------ > > > >_______________________________________________ > >Search-l mailing list > >Search-l at wikia.com > >http://lists.wikia.com/mailman/listinfo/search-l > >Change options or unsubscribe: > http://lists.wikia.com/mailman/options/search-l > > > > > -- > > > Nitin Borwankar > > http://walruscarpenter.wordpress.com Of shoes and ships and sealing > wax of cabbages and kings > http://greener.com Find, Learn, Act .... Greener, the search engine for > the planet > http://tagschema.com Implementation of tag database applications > > nitin at borwankar.com > 510-872-7066 > > > _______________________________________________ > Search-l mailing list > Search-l at wikia.com > http://lists.wikia.com/mailman/listinfo/search-l > Change options or unsubscribe: > http://lists.wikia.com/mailman/options/search-l > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.wikia.com/pipermail/search-l/attachments/20070604/20945198/attachment.html From wsurowiec at gmail.com Tue Jun 5 20:56:35 2007 From: wsurowiec at gmail.com (William Surowiec) Date: Tue, 05 Jun 2007 16:56:35 -0400 Subject: [Search-l] question and a potential request Message-ID: <4665CE03.3030003@gmail.com> I sent an email to the group a few days ago. I thought nothing of not receiving an email copy. Today I went to respond to my own email and discovered the original had been sent in html and the content had been stripped out. The content is available as an added click online but might possibly be the reason for my not receiving a copy. (I just checked, my preferences had been set to receive a copy.) So the question is - is this the reason I, and presumably others, did not receive a copy? If yes, I will make a plain text copy and repost it. I will also try to pay more attention to details in the future. My request is to the folks that manage the email software - if the above supposition is correct - when and if an unacceptably formatted (or otherwise judged unacceptable) email is received then return it to the sender with a link to a page detailing acceptable formats and content. This is not a complaint but rather an attempt to be helpful. (And a test to see if I receive a copy of this _plain text_ email.) Bill From wsurowiec at gmail.com Wed Jun 6 18:59:33 2007 From: wsurowiec at gmail.com (William Surowiec) Date: Wed, 06 Jun 2007 14:59:33 -0400 Subject: [Search-l] [relevancy of search results] Message-ID: <46670415.5090709@gmail.com> (This is a plain text reposting of an earlier, accidental html posting with an additional link at the end.) An interesting article (http://jcmc.indiana.edu/vol12/issue3/vancouvering.html#schemas) has begun to change my mind. I have been somewhat of a "lurker" waiting to gain access to crawl results to pass them through a "natural language processing" pipeline (see UIMA: http://incubator.apache.org/uima/.) I admit to not believing in the success of a voluntary group rating system (note, this is far from saying I believe in the opposite: that it will fail) I know that I do not, and cannot, know the outcome till we get there. The following quote from the article has forced me to question the potential efficacy of both my approach and "the" (quotes because it is only my impression of what I believe is still evolving) voluntary group rating system being discussed. *** quoted text follows *** What is relevance? In a small, well-defined database, it is relatively easy to sort relevant from irrelevant documents. On the Web, this is not necessarily as simple. One interviewee commented that the standard of relevance has changed from when he began to work with information retrieval systems: [W]here the systems used to only be the Dialogues and the Lexis-Nexises, you know, I think they strove for a more academic standard of relevance, where you define relevance as the relationship between the subject that is in the document with what the user is asking about. So it is sort of topical relevance. Whereas in the practical world where the search engines are reaching today, something being useful to the user and something where the user grabs the information and continues, has become, I think, more important and less emphasis on say, getting the best document. (Interviewee G) In other words, as this interviewee says elsewhere, it is about "satisfying users." Relevance has changed from some type of topical relevance based on an applied classification to something more subjective. *** end quoted text *** If this is so (and others may fairly argue against that point) then a determination of the user relevance of a link needs to be in alignment with the intentions of the user and is neither inherent in the document nor _any_ meta data associated with the link that is not so aligned. I believe this leads to requiring knowledge about the user that cannot be derived solely from the query - to impute the user's intent will require: 1. identification of the user (may be anonymous, but a specific anonymous user - a token in the user's possession) 2. the newly entered query from this user 3. the search history (the ordered collection of query and results returned and user action taken) of this user and many others 4. an ability to impute a current relevancy value for a link in a result set for a query given this user and the actions taken by similar user/query requests - the hard part I know that collecting this data will justifiably be offensive to some - given enough data, an anonymous user may be identified and a careless user far sooner. And, as we are open, this data _will_ be closely examined, sometimes by not nice people. Some users will doubtlessly be hurt. (It is neither cold heartedness nor insensitivity that prevents me from ameliorating that statement - if we collect this data we should do it knowing the consequences.) Given enough data I believe this approach will be both used and yield more relevant results than any other. The "used" and "yield" part of that sentence is the conversion in me wrought by the article. I now doubt a user would make the effort to use even a "semantic search" if one were available over a simple keyword search yielding good enough results with less effort on their part - sigh. Of course a semantic search would be preferentially used by "intelligent agents" - both software and some humans. But I sense neither is our target audience. I believe user history (aka personalization) will be a component in the approach taken by the "big boys" (I am intentionally trying to communicate a negative in that phrasing as I am annoyed by the belief that it is being done quietly by those who will posses a de facto, significant, and user appreciated advantage that will be well managed to not "cause trouble." ) I do not claim that being technically feasible or because others are doing it is sufficient reason for us to do it. But I do not believe in another way to deliver the most relevant results to a user (I am open to any data - especially contrary data.) One saving grace we might have, if we were to do this, would be our openness. This will help research efforts, inform the public, and possibly influence rule makers and others We now have servers - they are being provisioned. Shall we load the data released by AOL last year and begin exploring how to use this type of data? Bill ps - I discovered the article via a blog entry by Seth Finkelstein (http://sethf.com/) I intend this as a public thank you but realize it may yield other fruit :) pps - I've become aware of an additional article bearing on this point: http://jeffnolan.com/wp/2007/05/22/google-flirts-with-evil/ From wsurowiec at gmail.com Wed Jun 6 21:03:59 2007 From: wsurowiec at gmail.com (William Surowiec) Date: Wed, 06 Jun 2007 17:03:59 -0400 Subject: [Search-l] A Google Developer Day presentation by Peter Norvig Message-ID: <4667213F.9080009@gmail.com> ... entitled "Theorizing from Data: Avoiding the Capital Mistake" http://www.youtube.com/watch?v=nU8DcBF-qo4 It is about 50 minutes long. About a third of the way through (time stamp 13:30 - 21:30) is some interesting stuff on (my words, not his) statistical learning employed in the creation of a semantic web. There is a short, but insightful comment on using queries to help "understand" the content. Closer to the end (34:00) he speaks of better results through a better understanding of the user's intent (again, my words) and hooking up the query maker with the content maker in words that express another interpretation of "a personal way." The questions are good. All in all, better than network tv - and no commercials (unless one considers playing with this stuff attractive): The Google contribution to the LDC mentioned in the video: http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T13 From codonology at gmail.com Wed Jun 6 21:31:09 2007 From: codonology at gmail.com (Hua Fang) Date: Wed, 6 Jun 2007 17:31:09 -0400 Subject: [Search-l] To take on Google = Wikipedia + Unique "Concept Search" mechanism, Codonology Message-ID: <4c13dd5e0706061431u62a5e0bcve14c68ea3e06a2c5@mail.gmail.com> Hi folks: My name is Hua Fang. I am new to the "mode" of discussion in this group, but not new in this group. I have been listening to the discussion about "founding Wikia to take over Google" kind of topics for many months. Now, I think it's about time for me to pitch in my idea, which is straightly relevant but significantly different one from what you folks have been talking about. It is called "Codonology". For now, to simply put in philosophical and technological sense, you can think it as a concept search technology. I have revealed the detail about Codonology at my web site: www.Codonology.com . I welcome the comments from everybody in this group, especially from Mr. James Wales. Together, we can prevail. Thanks. Hua Fang, MD -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.wikia.com/pipermail/search-l/attachments/20070606/9bcd6610/attachment.html From jeremie at jabber.org Thu Jun 7 00:58:57 2007 From: jeremie at jabber.org (jer) Date: Wed, 6 Jun 2007 17:58:57 -0700 Subject: [Search-l] To take on Google = Wikipedia + Unique "Concept Search" mechanism, Codonology In-Reply-To: <4c13dd5e0706061431u62a5e0bcve14c68ea3e06a2c5@mail.gmail.com> References: <4c13dd5e0706061431u62a5e0bcve14c68ea3e06a2c5@mail.gmail.com> Message-ID: <04C72D5F-3165-49E8-A317-F3CF3C1770CD@jabber.org> Hua, I've read through much of your site and I want to make sure I'm understanding it right: Codonology is the term you created for the definition of "concepts" or knowledge filaments independent of language. Is that close? If the purpose is for search, how is this effectively different than natural language parsing, named entity recognition, and text categorization? Thanks, Jer On Jun 6, 2007, at 2:31 PM, Hua Fang wrote: > Hi folks: > My name is Hua Fang. I am new to the "mode" of discussion in this > group, but not new in this group. I have been listening to the > discussion about "founding Wikia to take over Google" kind of > topics for many months. Now, I think it's about time for me to > pitch in my idea, which is straightly relevant but significantly > different one from what you folks have been talking about. It is > called "Codonology". For now, to simply put in philosophical and > technological sense, you can think it as a concept search > technology. I have revealed the detail about Codonology at my web > site: www.Codonology.com. I welcome the comments from everybody in > this group, especially from Mr. James Wales. > > Together, we can prevail. > > Thanks. > > Hua Fang, MD > > _______________________________________________ > Search-l mailing list > Search-l at wikia.com > http://lists.wikia.com/mailman/listinfo/search-l > Change options or unsubscribe: http://lists.wikia.com/mailman/ > options/search-l -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.wikia.com/pipermail/search-l/attachments/20070606/d72256f1/attachment.html From nitin at borwankar.com Thu Jun 7 07:28:27 2007 From: nitin at borwankar.com (Nitin Borwankar) Date: Thu, 07 Jun 2007 00:28:27 -0700 Subject: [Search-l] "directory" vs. "search engine" In-Reply-To: <70b3cf150706031417j26ac3eefn516c114267f7dae4@mail.gmail.com> References: <355a36af0706010859u34e599ecs469c26393865a07@mail.gmail.com> <4660A555.7090401@gmail.com> <70b3cf150706021406x33ee2024id44552ce2303fedb@mail.gmail.com> <355a36af0706021513i60b04785mf8d89af02c0a55ad@mail.gmail.com> <355a36af0706021513p6a818a4euf09f00ef2a9800e0@mail.gmail.com> <70b3cf150706021537l6b099f1dt64b8c97a00cda6a2@mail.gmail.com> <355a36af0706021627x2fe804dmbfbf110a7d1fb5d3@mail.gmail.com> <1609692912-1180827584-cardhu_decombobulator_blackberry.rim.net-1671855404-@bxe034.bisx.prod.on.blackberry> <20070603124337.GA11441@sethf.com> <70b3cf150706031417j26ac3eefn516c114267f7dae4@mail.gmail.com> Message-ID: <4667B39B.30800@borwankar.com> Working for free and working as an employee are not the only two revenue models for content creators. There are two sides to search as a business - quality of search results which brings visitors and advertising which brings revenue. I am curious why no one is considering the option of ad revenue sharing with the content creators/editors. Why are the current models assuming that ad revenue goes to the search engine owners 100%? If the content=search quality is good enough to bring in the revenue, and the editors get a proportional share of the revenue from their pages wouldn't this get us out of the "free" vs. "employed with benefits" dichotomy? Unless we create new economic models for all participants, aren't we merely replacing one big monolith with a few "small" monoliths, but monoliths nevertheless from a revenue point of view? Nitin Borwankar Jason Calacanis wrote: >On 6/3/07, Seth Finkelstein wrote: > > >> Ah, but I think the key difference in Mahalo vs DMOZ is >>in a halfway reasonable monetization strategy (cream-skim the >>top search terms). This is not *comprehensive* coverage of the Web, >>which I think is tripping people up when it comes to discussing >>directory vs. general search engine. Rather, it's *focus*, on the >>specific segment likely to be most profitable. Yes, it's really >>more of a "directory", not a "search engine". But the other side of >>that is it's a directory which is optimized to work both "in" and "out" >>with a search engine, and an eye towards profitability. Which is something >>of a twist on the usual directory concept (which usually starts from a >>taxonomy and concerns itself with breadth). >> >> > >That is a very astute point Seth. We are doing the top 10,000 english >search terms, and we are using a search format/metaphor/design. We do >have categories, and you can navigate in a DMOZ-like way, so I like to >call Mahalo.com a "search service." > > > >>>it. Working for free as a hobby is fine (ie wikipedia), but why >>>anyone would work for free to make venture capitalists and ceos >>>right is beyond me. >>> >>> >> For the joy and happiness, the *community*, of course. I think >>part of what Y. Benkler is analyzing in his infamous book, though not put >>so bluntly, is this: If you have 100K to hire workers, you can put 10K >>each towards 10 people (and after benefits and overhead, pay them around >>5K each total), Or, put 100K towards a really good marketing flack >>who will go around trying to convince 10 people in the entire world to >>WORK FOR FREE, because gosh golly they're contributing to A New Era, >>and showing those elitist priests up there that citizen-amateurs can >>do a job without pay that's every bit as good as paid professionals. >> >> > >I've debated Benkler on this point, and at Wikimania in Boston this >summer one of the speakers (before or after Benkler) made a point of >demonstrating that folks involved in open source software projects >participated because they thought their participation would get them >some sort of financial return down the road (or something to that >effect--anyone remember?). > >Anyway, at the end of the day everyone needs to eat, and when your >hobby moves from being part time to full-time the rubber meets the >road. In fact, look and Jimmy and Angela who did amazing free work on >the Wikipedia for years and years and are now working on a for-profit >company--Wikia--backed by the most aggressive form of capital in the >world: venture capital. > >If the top folks from Wikipedia have left to "swing for the fences" at >a venture backed company that tells you something about Benkler's >theories now doesn't it? > > > >> But I get in trouble when I talk about that. At least in the >>wrong place. Anyway, I may be projecting, but I think the membership >>of this list skews more towards those on the code-developer end, and >>interested parties keeping an eye on the project, rather than those >>who will be doing grunt work of writing specific results. >> >> > >Exactly. That's the part I hate. Developers can get paid, >administrators can get paid, but editors don't!? What if Theresa in >your example got paid for doing good? Would that be such a bad thing? > >That's what we're trying to do at Mahalo: let everyone "get a taste" >not just the management teams of venture-backed companies. > >[ Note: I'm not saying Wikipedia should move to a paid model, but I >think if Wikipedia allowed OPT-IN advertising they could hire 10-25 >editors f/t to work from home and administer the system. It would be a >start. ] > >best regards and Mahalo ;-) for the amazing feedback Seth... you're >like having a free management consultant on 24-hour duty!!! > >Jason >www.mahalo.com >_______________________________________________ >Search-l mailing list >Search-l at wikia.com >http://lists.wikia.com/mailman/listinfo/search-l >Change options or unsubscribe: http://lists.wikia.com/mailman/options/search-l > > -- Nitin Borwankar http://walruscarpenter.wordpress.com Of shoes and ships and sealing wax of cabbages and kings http://greener.com Find, Learn, Act .... Greener, the search engine for the planet http://tagschema.com Implementation of tag database applications nitin at borwankar.com 510-872-7066 From jeremie at jabber.org Thu Jun 7 17:42:11 2007 From: jeremie at jabber.org (jer) Date: Thu, 7 Jun 2007 10:42:11 -0700 Subject: [Search-l] [relevancy of search results] In-Reply-To: <46670415.5090709@gmail.com> References: <46670415.5090709@gmail.com> Message-ID: <376E8E68-07BA-4E5B-85FA-32258798F198@jabber.org> Very well thought out Bill, and as you pointed out and anyone can plainly see, Google believes that they can make search more relevant by knowing the user better too. I always have to fall back on the tools I'm comfortable with, so I see simple solutions to the privacy issues by using open standards. It's far more reasonable to have either local tools on your desktop or trusted 3rd parties like the Attention Trust, who intelligently compile intention "vectors" for you. These vectors can be simple common definitions and a simple format, that you can decide to include with your search queries. Ultimately, I don't believe that it's the search engine itself that should know anything about you, it should simply support the ability to search more intelligently (beyond just keywords). Jer On Jun 6, 2007, at 11:59 AM, William Surowiec wrote: > (This is a plain text reposting of an earlier, accidental html posting > with an additional link at the end.) > > An interesting article > (http://jcmc.indiana.edu/vol12/issue3/vancouvering.html#schemas) has > begun to change my mind. > > I have been somewhat of a "lurker" waiting to gain access to crawl > results to pass them through a "natural language processing" pipeline > (see UIMA: http://incubator.apache.org/uima/.) I admit to not > believing > in the success of a voluntary group rating system (note, this is far > from saying I believe in the opposite: that it will fail) I know > that I > do not, and cannot, know the outcome till we get there. > > The following quote from the article has forced me to question the > potential efficacy of both my approach and "the" (quotes because it is > only my impression of what I believe is still evolving) voluntary > group > rating system being discussed. > > *** quoted text follows *** > > What is relevance? In a small, well-defined database, it is relatively > easy to sort relevant from irrelevant documents. On the Web, this > is not > necessarily as simple. One interviewee commented that the standard of > relevance has changed from when he began to work with information > retrieval systems: > > [W]here the systems used to only be the Dialogues and the Lexis- > Nexises, > you know, I think they strove for a more academic standard of > relevance, > where you define relevance as the relationship between the subject > that > is in the document with what the user is asking about. So it is > sort of > topical relevance. Whereas in the practical world where the search > engines are reaching today, something being useful to the user and > something where the user grabs the information and continues, has > become, I think, more important and less emphasis on say, getting the > best document. (Interviewee G) > > In other words, as this interviewee says elsewhere, it is about > "satisfying users." Relevance has changed from some type of topical > relevance based on an applied classification to something more > subjective. > > *** end quoted text *** > > If this is so (and others may fairly argue against that point) then a > determination of the user relevance of a link needs to be in alignment > with the intentions of the user and is neither inherent in the > document > nor _any_ meta data associated with the link that is not so aligned. > > I believe this leads to requiring knowledge about the user that cannot > be derived solely from the query - to impute the user's intent will > require: > > 1. identification of the user (may be anonymous, but a specific > anonymous user - a token in the user's possession) > > 2. the newly entered query from this user > > 3. the search history (the ordered collection of query and results > returned and user action taken) of this user and many others > > 4. an ability to impute a current relevancy value for a link in a > result set for a query given this user and the actions taken by > similar user/query requests - the hard part > > I know that collecting this data will justifiably be offensive to > some - > given enough data, an anonymous user may be identified and a careless > user far sooner. And, as we are open, this data _will_ be closely > examined, sometimes by not nice people. Some users will doubtlessly be > hurt. (It is neither cold heartedness nor insensitivity that > prevents me > from ameliorating that statement - if we collect this data we > should do > it knowing the consequences.) > > Given enough data I believe this approach will be both used and yield > more relevant results than any other. The "used" and "yield" part of > that sentence is the conversion in me wrought by the article. I now > doubt a user would make the effort to use even a "semantic search" if > one were available over a simple keyword search yielding good enough > results with less effort on their part - sigh. Of course a semantic > search would be preferentially used by "intelligent agents" - both > software and some humans. But I sense neither is our target audience. > > I believe user history (aka personalization) will be a component in > the > approach taken by the "big boys" (I am intentionally trying to > communicate a negative in that phrasing as I am annoyed by the belief > that it is being done quietly by those who will posses a de facto, > significant, and user appreciated advantage that will be well > managed to > not "cause trouble." ) > > I do not claim that being technically feasible or because others are > doing it is sufficient reason for us to do it. But I do not believe in > another way to deliver the most relevant results to a user (I am > open to > any data - especially contrary data.) > > One saving grace we might have, if we were to do this, would be our > openness. This will help research efforts, inform the public, and > possibly influence rule makers and others > > We now have servers - they are being provisioned. Shall we load the > data > released by AOL last year and begin exploring how to use this type > of data? > > Bill > > ps - I discovered the article via a blog entry by Seth Finkelstein > (http://sethf.com/) I intend this as a public thank you but realize it > may yield other fruit :) > > pps - I've become aware of an additional article bearing on this > point: > http://jeffnolan.com/wp/2007/05/22/google-flirts-with-evil/ > > > > _______________________________________________ > Search-l mailing list > Search-l at wikia.com > http://lists.wikia.com/mailman/listinfo/search-l > Change options or unsubscribe: http://lists.wikia.com/mailman/ > options/search-l From contact at tallstreet.com Fri Jun 8 21:43:27 2007 From: contact at tallstreet.com (Tall Street) Date: Sat, 9 Jun 2007 09:43:27 +1200 Subject: [Search-l] more than just interoperability In-Reply-To: <8e447b720706040759n31ce078fle6244b39158fa73@mail.gmail.com> References: <355a36af0705312336v528c3aa3v1d8d43c112f1d375@mail.gmail.com> <86FDA1CC-9217-4C27-BC88-E089B6EFCC43@jabber.org> <46631D8B.1010800@borwankar.com> <8e447b720706040759n31ce078fle6244b39158fa73@mail.gmail.com> Message-ID: You bring up an interesting idea. Of course the first concern with any data retrieved from the client is privacy, It has to be clear exactly what is being sent back and preferably the client should authorize everything that is sent beforehand (it shouldn't passively collect information in the background and just send it otherwise there is a big risk of sending private information that the client may not wish others to know.) Having said that why limit to social networks? How about a toolbar extension that collects your browser history and associates meta data with that (such as the search terms you used before you located the site, or the anchor text of the link you clicked to get there, time spent and number of times visited could be associated to help determine usefulness). Such as extension would be useful in and off itself for helping people locate sites they remembered they visited but forgot the url, or title and just remembered a few things about the site. Initially collect the data and store it locally and allow the client to search it. Then add a feature that lets the client review the sites they visited and makes recommendations about what links they should send back and under which keywords. Use something like http://www.tallstreet.com/ to rank the data (so people who have no history / history of sending not useful results get only a tiny weighting on the ranking and people who have a history of sending back useful links get a greater weighting) and you will definately have an interesting and more useful dataset then what you will get if you just count links to a page. Any Thoughts? Gary On 6/5/07, Fred Benenson wrote: > Instead of publicly crawling the human indexes (del.icio.us / stumbleupon, > etc.) ourselves, why don't we have our users to do it? It's not a complete > work around, but might be an approach that works for a bit. Here's how I > envision it: > > * A client side crawler (similar to yacy, but not targeting the entire web, > just metadata rich places) implemented through a Firefox extension or > Greasemonkey script. > * When a client visits a social network with valuable data (as determined by > a list managed by us) their local client makes a copy of all the data > delivered to their client side browser. > * The server can't tell the difference between a user surfing with the > extension or without the extension. > * That data is then meta-tagged and packaged properly locally and sent to > the Wikia Search servers from the client's machine. > * The Wikia Search servers then index and make sense of all of this data > culled from the various clients running the Wikia Search client / extension. > > This way we're able to work around the bandwidth concerns that Yahoo and > company would have with us crawling their databases. And the data that we're > getting is merely stuff that is being browsed naturally, by live humans, so > it's likely of more value. > > But bandwidth is obviously not just what they're concerned with. As has been > mentioned, these sites view these databases of useful human tagged > information as enormously valuable assets that give them a competitive edge. > So then it's a question of the "intellectual property" contained in those > databases. Now, I'm not sure if other networks do this, but Del.icio.us and > Flickr have Creative Commons license implementation. That means that a > particular user's stream of content that they've created (links, photos, > etc.) can be set for people to share it. I think this would be a perfect > opportunity for our distributed crawlers to take advantage of. > > Thoughts? > > > Fred > > > From odp at freenet.de Sat Jun 9 17:22:36 2007 From: odp at freenet.de (Chris) Date: Sat, 09 Jun 2007 19:22:36 +0200 Subject: [Search-l] Fwd: more than just interoperability Message-ID: <466AE1DC.5000604@freenet.de> Hi, as the question of using ODP data came up above, allow me to post some links for background info on data use: Overview article on data use in the volunteer community?s newsletter: http://dmoz.org/newsletter/2006Spring/odp_data_use.html Selected examples for use in Science: http://dmoz.org/Computers/Internet/Searching/Directories/Open_Directory_Project/Research_Papers Patents that quote ODP in 2006: http://www.seobythesea.com/?p=419 The more recent research papers and patents offer some insights how a modern search engine might profit from using ODP data, beyond the obvious (i.e. using it as seeding set, regular crawling or setting up a copy). A bit more practical stuff: Data dump: http://rdf.dmoz.org/ License: http://dmoz.org/license.html Various tools for processing the dump: http://dmoz.org/Computers/Internet/Searching/Directories/Open_Directory_Project/Use_of_ODP_Data/ Regards, Chris ODP volunteer admin chris2001 - http://dmoz.org/profiles/chris2001.html From wsurowiec at gmail.com Sun Jun 10 11:57:07 2007 From: wsurowiec at gmail.com (William Surowiec) Date: Sun, 10 Jun 2007 07:57:07 -0400 Subject: [Search-l] [relevancy of search results] In-Reply-To: <376E8E68-07BA-4E5B-85FA-32258798F198@jabber.org> References: <46670415.5090709@gmail.com> <376E8E68-07BA-4E5B-85FA-32258798F198@jabber.org> Message-ID: <466BE713.6070000@gmail.com> Thank you Jer. I have installed the Firefox plug in for Attention Trust (http://attentiontrust.org/) and have begun collecting local data on my browsing usage. A tool that allows a person to "own" their browsing history and share it as they see fit _is_ empowering. (Of course the devil lies in the details, but that is only a touch of show me, let me examine the evidence and not active concern.) I truly hope a mechanism for the sharing of the user's data opens for researchers and that a significant number of users opt in. Upon reflection I realize I may have been reaching for a trumpet, in my prior post, to offer to others the ability to loudly proclaim a "clear and present danger." While that is a concern I hold, it is not cause for me to act as if it were a demonstrable fact. I know Google, as a significant example, allows me to review my search history and to opt out of their program collecting the data. I also acknowledge that if they shared the data probably more immediate, direct harm would arise through the preying actions of the "not_nice_ones." I honestly do not view them (Google) as inimical to my interests. But I worry that they will change, especially when managers more oriented to "Wall Street concerns" become ascendant. I know the power accruing to a few individuals in large firms is a problem endemic to our times. But being common does not mitigate the risk. We should be mindful that we are witnessing the creation of tomorrow's economic Leviathans. (As I've grown older I appreciate more what George Washington did at the end of his presidency than what he did before. The latter was overcoming hardships when his position was weak; the former was acting for the better good of all when he could have done otherwise - something far harder.) Bill jer wrote: > Very well thought out Bill, and as you pointed out and anyone can > plainly see, Google believes that they can make search more relevant > by knowing the user better too. > > I always have to fall back on the tools I'm comfortable with, so I see > simple solutions to the privacy issues by using open standards. It's > far more reasonable to have either local tools on your desktop or > trusted 3rd parties like the Attention Trust, who intelligently > compile intention "vectors" for you. These vectors can be simple > common definitions and a simple format, that you can decide to include > with your search queries. > > Ultimately, I don't believe that it's the search engine itself that > should know anything about you, it should simply support the ability > to search more intelligently (beyond just keywords). > > Jer > > jer wrote: > Very well thought out Bill, and as you pointed out and anyone can > plainly see, Google believes that they can make search more relevant > by knowing the user better too. > > I always have to fall back on the tools I'm comfortable with, so I see > simple solutions to the privacy issues by using open standards. It's > far more reasonable to have either local tools on your desktop or > trusted 3rd parties like the Attention Trust, who intelligently > compile intention "vectors" for you. These vectors can be simple > common definitions and a simple format, that you can decide to include > with your search queries. > > Ultimately, I don't believe that it's the search engine itself that > should know anything about you, it should simply support the ability > to search more intelligently (beyond just keywords). > > Jer > > On Jun 6, 2007, at 11:59 AM, William Surowiec wrote: > >> (This is a plain text reposting of an earlier, accidental html posting >> with an additional link at the end.) >> >> An interesting article >> (http://jcmc.indiana.edu/vol12/issue3/vancouvering.html#schemas) has >> begun to change my mind. >> >> I have been somewhat of a "lurker" waiting to gain access to crawl >> results to pass them through a "natural language processing" pipeline >> (see UIMA: http://incubator.apache.org/uima/.) I admit to not believing >> in the success of a voluntary group rating system (note, this is far >> from saying I believe in the opposite: that it will fail) I know that I >> do not, and cannot, know the outcome till we get there. >> >> The following quote from the article has forced me to question the >> potential efficacy of both my approach and "the" (quotes because it is >> only my impression of what I believe is still evolving) voluntary group >> rating system being discussed. >> >> *** quoted text follows *** >> >> What is relevance? In a small, well-defined database, it is relatively >> easy to sort relevant from irrelevant documents. On the Web, this is not >> necessarily as simple. One interviewee commented that the standard of >> relevance has changed from when he began to work with information >> retrieval systems: >> >> [W]here the systems used to only be the Dialogues and the Lexis-Nexises, >> you know, I think they strove for a more academic standard of relevance, >> where you define relevance as the relationship between the subject that >> is in the document with what the user is asking about. So it is sort of >> topical relevance. Whereas in the practical world where the search >> engines are reaching today, something being useful to the user and >> something where the user grabs the information and continues, has >> become, I think, more important and less emphasis on say, getting the >> best document. (Interviewee G) >> >> In other words, as this interviewee says elsewhere, it is about >> "satisfying users." Relevance has changed from some type of topical >> relevance based on an applied classification to something more >> subjective. >> >> *** end quoted text *** >> >> If this is so (and others may fairly argue against that point) then a >> determination of the user relevance of a link needs to be in alignment >> with the intentions of the user and is neither inherent in the document >> nor _any_ meta data associated with the link that is not so aligned. >> >> I believe this leads to requiring knowledge about the user that cannot >> be derived solely from the query - to impute the user's intent will >> require: >> >> 1. identification of the user (may be anonymous, but a specific >> anonymous user - a token in the user's possession) >> >> 2. the newly entered query from this user >> >> 3. the search history (the ordered collection of query and results >> returned and user action taken) of this user and many others >> >> 4. an ability to impute a current relevancy value for a link in a >> result set for a query given this user and the actions taken by >> similar user/query requests - the hard part >> >> I know that collecting this data will justifiably be offensive to some - >> given enough data, an anonymous user may be identified and a careless >> user far sooner. And, as we are open, this data _will_ be closely >> examined, sometimes by not nice people. Some users will doubtlessly be >> hurt. (It is neither cold heartedness nor insensitivity that prevents me >> from ameliorating that statement - if we collect this data we should do >> it knowing the consequences.) >> >> Given enough data I believe this approach will be both used and yield >> more relevant results than any other. The "used" and "yield" part of >> that sentence is the conversion in me wrought by the article. I now >> doubt a user would make the effort to use even a "semantic search" if >> one were available over a simple keyword search yielding good enough >> results with less effort on their part - sigh. Of course a semantic >> search would be preferentially used by "intelligent agents" - both >> software and some humans. But I sense neither is our target audience. >> >> I believe user history (aka personalization) will be a component in the >> approach taken by the "big boys" (I am intentionally trying to >> communicate a negative in that phrasing as I am annoyed by the belief >> that it is being done quietly by those who will posses a de facto, >> significant, and user appreciated advantage that will be well managed to >> not "cause trouble." ) >> >> I do not claim that being technically feasible or because others are >> doing it is sufficient reason for us to do it. But I do not believe in >> another way to deliver the most relevant results to a user (I am open to >> any data - especially contrary data.) >> >> One saving grace we might have, if we were to do this, would be our >> openness. This will help research efforts, inform the public, and >> possibly influence rule makers and others >> >> We now have servers - they are being provisioned. Shall we load the data >> released by AOL last year and begin exploring how to use this type of >> data? >> >> Bill >> >> ps - I discovered the article via a blog entry by Seth Finkelstein >> (http://sethf.com/) I intend this as a public thank you but realize it >> may yield other fruit :) >> >> pps - I've become aware of an additional article bearing on this point: >> http://jeffnolan.com/wp/2007/05/22/google-flirts-with-evil/ >> >> >> >> _______________________________________________ >> Search-l mailing list >> Search-l at wikia.com >> http://lists.wikia.com/mailman/listinfo/search-l >> Change options or unsubscribe: >> http://lists.wikia.com/mailman/options/search-l > > From sethf at sethf.com Sun Jun 10 13:53:41 2007 From: sethf at sethf.com (Seth Finkelstein) Date: Sun, 10 Jun 2007 09:53:41 -0400 Subject: [Search-l] /Message - "Jeremie Miller and Wikia Search" Message-ID: <20070610135341.GA12486@sethf.com> [Disclaimer - this is Stowe Boyd's web post] http://www.stoweboyd.com/message/2007/06/jeremie_miller_.html "Jeremie Miller dropped me an email yesterday, letting me know he had looked for me while visiting 625 2nd Street in San Francisco. He was there meeting other folks with offices in the building. I asked what he was up to, and he responded that he is working with Jimmy Wales and the folks at Wikia on a new open source search platform:" [press release snipped] "I plan to ask Jeremie to join me in an upcoming episode of /Talkshow to tell us all more about it." [SF: 625 2nd Street is the LookSmart building, http://startup.wsj.com/ecommerce/ecommerce/20061214-tam.html "Just as in the dot-com days, most of the start-ups at 625 Second St. are related to the Internet. Many are unprofitable so far but say they expect to make money from online advertising, just as search giant Google Inc. has done." -- Seth Finkelstein Consulting Programmer http://sethf.com/ Infothought blog - http://sethf.com/infothought/blog/ Interview: http://sethf.com/essays/major/greplaw-interview.php From wsurowiec at gmail.com Sun Jun 10 14:02:40 2007 From: wsurowiec at gmail.com (William Surowiec) Date: Sun, 10 Jun 2007 10:02:40 -0400 Subject: [Search-l] the notes of Stewart Brand on a recent long_now talk Message-ID: <466C0480.4060901@gmail.com> Stewart Brand, writing notes on a recent Long Now talk by Paul Hawken's, "The New Great Transformation,", uses phrases that I believe apply elsewhere. Hawken became aware of the number of environmental groups and his thoughts about it lead to a book, his talk, and Brand's notes. An extended quote from Stewart Brand: *** start quoted text **** ... he estimates there are over 1,000,000 such organizations in the world, adding up to the largest and fastest growing Movement in history. The phenomenon has been overlooked because it lacks the customary hallmarks of a movement--- no charismatic leaders, no grand theory or ideology, no "ism," no defining events. The new activist groups are about dispersing power rather than aggregating power. Their focus is on ideas rather than ideology--- ideologies are clung to, but ideas can be tried and tossed or improved. The point is to solve problems, usually from the bottom up. The movement can never be divided because it is already atomized. What's going on? Hawken wondered if humanity might have some collective intelligence that we don't yet understand. The metaphor he finds most useful is the immune system, which is the most complex system in our body--- more complex than the entire Internet--- massive, distributed, subtle, ingenious, and effective. The opposite of a hierarchical army, its power is in the density of its network. It deals with problems not through frontal attack but complex negotiation and rapprochement. ... *** end quoted text *** Perhaps even cynics might pause to wonder if collective intelligences might be emerging - and recognize their role as a necessity to the healthy functioning of the whole. The Long Now Foundation - http://www.longnow.org Seminars & downloads: http://www.longnow.org/projects/seminars/ database created by Paul Hawkin: WiserEarth.org Bill From sami2065 at gmail.com Mon Jun 11 00:03:57 2007 From: sami2065 at gmail.com (Sami M) Date: Sun, 10 Jun 2007 17:03:57 -0700 Subject: [Search-l] call to action.... Message-ID: Hi Folks, I've been following this list with interest. Internet search is my passion and I'd been following it since 1998 starting with some projects in grad school. I'd been working on a prototype of a large scale search engine the last 2 years. This has been done mostly moonlighting nights and weekends except for the last three months when I left my day job to work on it fulltime. In summary? I've built a working implementation of the original Google prototype according to their Stanford paper all pushing approximately 60K lines of mostly C code. It is scalable on a cluster of commodity linux boxes with some additional work. A single server in this case can crawl, index, and serve 50M documents. It is turning out to be a big task and now I am looking out for options on what direction to take next. This is a call to action. I am open to any suggestions or feedback. If anyone is interested in joining hands, collaborating, or investing in any sort of way I'd be interested in talking about it. I am based in San Francisco bayarea. Cheers.. Sami sami2065 at gmail.com Re: The Anatomy of a Large-Scale Hypertextual Web Search Engine (http:// infolab.stanford.edu/~backrub/google.html) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.wikia.com/pipermail/search-l/attachments/20070610/502ce544/attachment.html From aerik at thesylvans.com Mon Jun 11 00:35:23 2007 From: aerik at thesylvans.com (Aerik Sylvan) Date: Sun, 10 Jun 2007 17:35:23 -0700 Subject: [Search-l] call to action.... In-Reply-To: References: Message-ID: <355a36af0706101735t12d78259wa4dac3e76f981be1@mail.gmail.com> So... are you looking to join up with the Wikia project? Or is this just kind of a fishing email to see what happens (I'm not criticizing, just clarifying). I, for one, think writing a search app in C probably makes a heckuva lotta sense, if it's really going to scale. However, if your project is based on the original paper, are you infringing on any (ugh) patents? Best Regards, Aerik On 6/10/07, Sami M wrote: > > > > Hi Folks, > > > > I've been following this list with interest. Internet search is my passion > and I'd been following it since 1998 starting with some projects in grad > school. > > > > I'd been working on a prototype of a large scale search engine the last 2 > years. This has been done mostly moonlighting nights and weekends except for > the last three months when I left my day job to work on it fulltime. In > summary? I've built a working implementation of the original Google > prototype according to their Stanford paper all pushing approximately 60K > lines of mostly C code. It is scalable on a cluster of commodity linux boxes > with some additional work. A single server in this case can crawl, index, > and serve 50M documents. > > > > It is turning out to be a big task and now I am looking out for options on > what direction to take next. This is a call to action. I am open to any > suggestions or feedback. If anyone is interested in joining hands, > collaborating, or investing in any sort of way I'd be interested in talking > about it. I am based in San Francisco bayarea. > > > > Cheers.. > > > > Sami > > sami2065 at gmail.com > > > Re: The Anatomy of a Large-Scale Hypertextual Web Search Engine (http:// > infolab.stanford.edu/~backrub/google.html > ) > _______________________________________________ > Search-l mailing list > Search-l at wikia.com > http://lists.wikia.com/mailman/listinfo/search-l > Change options or unsubscribe: > http://lists.wikia.com/mailman/options/search-l > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.wikia.com/pipermail/search-l/attachments/20070610/30fd93d4/attachment-0001.html From jeremie at jabber.org Mon Jun 11 04:57:46 2007 From: jeremie at jabber.org (jer) Date: Sun, 10 Jun 2007 23:57:46 -0500 Subject: [Search-l] call to action.... In-Reply-To: References: Message-ID: Sami, are you looking for help, ah-la open source style? I know I'd definitely be interested if so :) If it is open source, you're welcome to sign up for the lab servers for (additional?) dev/testing or any project hosting, etc. Jer On Jun 10, 2007, at 7:03 PM, Sami M wrote: > > Hi Folks, > > I've been following this list with interest. Internet search is my > passion and I'd been following it since 1998 starting with some > projects in grad school. > > I'd been working on a prototype of a large scale search engine the > last 2 years. This has been done mostly moonlighting nights and > weekends except for the last three months when I left my day job to > work on it fulltime. In summary? I've built a working > implementation of the original Google prototype according to their > Stanford paper all pushing approximately 60K lines of mostly C > code. It is scalable on a cluster of commodity linux boxes with > some additional work. A single server in this case can crawl, > index, and serve 50M documents. > > It is turning out to be a big task and now I am looking out for > options on what direction to take next. This is a call to action. I > am open to any suggestions or feedback. If anyone is interested in > joining hands, collaborating, or investing in any sort of way I'd > be interested in talking about it. I am based in San Francisco > bayarea. > > Cheers.. > > Sami > > sami2065 at gmail.com > > > Re: The Anatomy of a Large-Scale Hypertextual Web Search Engine > (http:// infolab.stanford.edu/~backrub/google.html ) > _______________________________________________ > Search-l mailing list > Search-l at wikia.com > http://lists.wikia.com/mailman/listinfo/search-l > Change options or unsubscribe: http://lists.wikia.com/mailman/ > options/search-l -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.wikia.com/pipermail/search-l/attachments/20070610/2d421eba/attachment.html From nitin at borwankar.com Mon Jun 11 06:38:35 2007 From: nitin at borwankar.com (Nitin Borwankar) Date: Sun, 10 Jun 2007 23:38:35 -0700 Subject: [Search-l] call to action.... In-Reply-To: References: Message-ID: <466CEDEB.4080208@borwankar.com> Sami M wrote: > > > Hi Folks, > > > > I've been following this list with interest. Internet search is my > passion and I'd been following it since 1998 starting with some > projects in grad school. > > > > I'd been working on a prototype of a large scale search engine the > last 2 years. This has been done mostly moonlighting nights and > weekends except for the last three months when I left my day job to > work on it fulltime. In summary? I've built a working implementation > of the original Google prototype according to their Stanford paper all > pushing approximately 60K lines of mostly C code. It is scalable on a > cluster of commodity linux boxes with some additional work. A single > server in this case can crawl, index, and serve 50M documents. > > > > It is turning out to be a big task and now I am looking out for > options on what direction to take next. This is a call to action. I am > open to any suggestions or feedback. If anyone is interested in > joining hands, collaborating, or investing in any sort of way I'd be > interested in talking about it. I am based in San Francisco bayarea. > > > > Cheers.. > > > > Sami > Hi Sami, You may want to decide how you want to license the source - you need to decide this and publicize it before inviting participants so everyone is clear what happens to their code contributions etc. Just a thought. Nitin -- Nitin Borwankar http://walruscarpenter.wordpress.com Of shoes and ships and sealing wax of cabbages and kings http://greener.com Find, Learn, Act .... Greener, the search engine for the planet http://tagschema.com Implementation of tag database applications nitin at borwankar.com 510-872-7066 From peter.burden at gmail.com Mon Jun 11 15:04:17 2007 From: peter.burden at gmail.com (peter burden) Date: Mon, 11 Jun 2007 16:04:17 +0100 Subject: [Search-l] call to action.... In-Reply-To: References: Message-ID: <466D6471.5020805@gmail.com> Sami M wrote: > > Hi Folks, > > I've been following this list with interest. Internet search is my > passion and I'd been following it since 1998 starting with some > projects in grad school. > > I'd been working on a prototype of a large scale search engine the > last 2 years. This has been done mostly moonlighting nights and > weekends except for the last three months when I left my day job to > work on it fulltime. In summary? I've built a working implementation > of the original Google prototype according to their Stanford paper all > pushing approximately 60K lines of mostly C code. It is scalable on a > cluster of commodity linux boxes with some additional work. A single > server in this case can crawl, index, and serve 50M documents. > That's most interesting but I'd echo concerns about whether this would violate anybody's IPR, software patents etc. A quick check reveals 50 Google patents See http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&p=1&u=%2Fnetahtml%2Fsearch-bool.html&r=47&f=G&l=50&co1=AND&d=ptxt&s1=Google.ASNM.&OS=AN/Google&RS=AN/Google (comments about the length of the URL should be directed to the US Patent Office) I haven't yet fully analysed this and the diagrams seem to use an image format my browser doesn't know how to handle. About 20% of the patents look as if they may be relevant to an SE - some are clearly to do with extra bells and whistles on page ranking - others are more interesting, for example 6,658,423 (december 2003) relates to detection of near-identical duplicates and looks very similar to the techniques I use for the same purpose in my crawler (Grr!) I'm not sure what other expressions of Google IPR we might transgress although, being secretive, they may not have patented some of their best ideas in the interests of secrecy. Incidentally if we do build on Sami's software I can offer a crawler that will do 50 pages/sec using two very modest domestic PCs (one crawling/parsing and one saving metadata in a MySQL database). It's written in C and is multi-threaded. > It is turning out to be a big task and now I am looking out for > options on what direction to take next. This is a call to action. I am > open to any suggestions or feedback. If anyone is interested in > joining hands, collaborating, or investing in any sort of way I'd be > interested in talking about it. I am based in San Francisco bayarea. > I'm about 6,000 miles from San Francisco so can't really offer much directly. > Cheers.. > > Sami > > sami2065 at gmail.com > > > Re: The Anatomy of a Large-Scale Hypertextual Web Search Engine > (http:// infolab.stanford.edu/~backrub/google.html > ) > > ------------------------------------------------------------------------ > > _______________________________________________ > Search-l mailing list > Search-l at wikia.com > http://lists.wikia.com/mailman/listinfo/search-l > Change options or unsubscribe: http://lists.wikia.com/mailman/options/search-l From fred.benenson at gmail.com Mon Jun 11 15:35:06 2007 From: fred.benenson at gmail.com (Fred Benenson) Date: Mon, 11 Jun 2007 11:35:06 -0400 Subject: [Search-l] call to action.... In-Reply-To: <466D6471.5020805@gmail.com> References: <466D6471.5020805@gmail.com> Message-ID: <8e447b720706110835t28c24b1p8d2a78951bf9b5a1@mail.gmail.com> I'm no lawyer, but It is my understanding that "the" google patents are actually owned by Stanford. This is because Page & Brin were graduate students working on their PhDs when they wrote that paper, and Stanford's policy is to own the work that gets grants / funding. This means every time Google wants to license most of the SE / PageRank stuff out, say to IBM looking to index their internal network of documents or sites, Stanford gets a slice of the pie. There's also the possibility that Google has essentially obsoleted that original technology, and if the patent is specific enough, Google's current workings have no longer any ties to that paper, so we would only be getting into trouble with Stanford. Fred B On 6/11/07, peter burden wrote: > > Sami M wrote: > > > > Hi Folks, > > > > I've been following this list with interest. Internet search is my > > passion and I'd been following it since 1998 starting with some > > projects in grad school. > > > > I'd been working on a prototype of a large scale search engine the > > last 2 years. This has been done mostly moonlighting nights and > > weekends except for the last three months when I left my day job to > > work on it fulltime. In summary? I've built a working implementation > > of the original Google prototype according to their Stanford paper all > > pushing approximately 60K lines of mostly C code. It is scalable on a > > cluster of commodity linux boxes with some additional work. A single > > server in this case can crawl, index, and serve 50M documents. > > > > That's most interesting but I'd echo concerns about whether this would > violate anybody's IPR, software > patents etc. A quick check reveals 50 Google patents > > See > > > http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&p=1&u=%2Fnetahtml%2Fsearch-bool.html&r=47&f=G&l=50&co1=AND&d=ptxt&s1=Google.ASNM.&OS=AN/Google&RS=AN/Google > > (comments about the length of the URL should be directed to the US > Patent Office) > > I haven't yet fully analysed this and the diagrams seem to use an image > format > my browser doesn't know how to handle. About 20% of the patents look as if > they may be relevant to an SE - some are clearly to do with extra bells > and whistles > on page ranking - others are more interesting, for example 6,658,423 > (december > 2003) relates to detection of near-identical duplicates and looks very > similar to > the techniques I use for the same purpose in my crawler (Grr!) > > I'm not sure what other expressions of Google IPR we might transgress > although, > being secretive, they may not have patented some of their best ideas in > the interests > of secrecy. > > Incidentally if we do build on Sami's software I can offer a crawler > that will do > 50 pages/sec using two very modest domestic PCs (one crawling/parsing and > one saving metadata in a MySQL database). It's written in C and is > multi-threaded. > > > It is turning out to be a big task and now I am looking out for > > options on what direction to take next. This is a call to action. I am > > open to any suggestions or feedback. If anyone is interested in > > joining hands, collaborating, or investing in any sort of way I'd be > > interested in talking about it. I am based in San Francisco bayarea. > > > I'm about 6,000 miles from San Francisco so can't really offer much > directly. > > > Cheers.. > > > > Sami > > > > sami2065 at gmail.com > > > > > > Re: The Anatomy of a Large-Scale Hypertextual Web Search Engine > > (http:// infolab.stanford.edu/~backrub/google.html > > ) > > > > ------------------------------------------------------------------------ > > > > _______________________________________________ > > Search-l mailing list > > Search-l at wikia.com > > http://lists.wikia.com/mailman/listinfo/search-l > > Change options or unsubscribe: > http://lists.wikia.com/mailman/options/search-l > > > _______________________________________________ > Search-l mailing list > Search-l at wikia.com > http://lists.wikia.com/mailman/listinfo/search-l > Change options or unsubscribe: > http://lists.wikia.com/mailman/options/search-l > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.wikia.com/pipermail/search-l/attachments/20070611/48d5ac7e/attachment.html From patrick at foxmarks.com Mon Jun 11 16:40:54 2007 From: patrick at foxmarks.com (Patrick Corcoran) Date: Mon, 11 Jun 2007 09:40:54 -0700 Subject: [Search-l] call to action.... In-Reply-To: References: Message-ID: <466D7B16.2000904@foxmarks.com> Sami, This sounds like a huge engineering accomplishment. Clearly you have put a lot of thought and effort into this. Kudos! I am curious though: what is it you are trying to build? Is there a deficiency in how Google ranks its results that you are trying to exploit? Is there a new strategy or algorithm to your search? How is it different from what has already been done quite a few times since the early 90's? (I hope these questions do not sound critical -- they are not intended to be. I'm just more intrigued at the moment by the question "why?" than the questions "how?" or "what?") Also, is there any place online where we could have a preview of its current state? regards, Patrick Corcoran Sami M wrote: > > > > Hi Folks, > > > > I've been following this list with interest. Internet search is my > passion and I'd been following it since 1998 starting with some > projects in grad school. > > > > I'd been working on a prototype of a large scale search engine the > last 2 years. This has been done mostly moonlighting nights and > weekends except for the last three months when I left my day job to > work on it fulltime. In summary? I've built a working implementation > of the original Google prototype according to their Stanford paper all > pushing approximately 60K lines of mostly C code. It is scalable on a > cluster of commodity linux boxes with some additional work. A single > server in this case can crawl, index, and serve 50M documents. > > > > It is turning out to be a big task and now I am looking out for > options on what direction to take next. This is a call to action. I am > open to any suggestions or feedback. If anyone is interested in > joining hands, collaborating, or investing in any sort of way I'd be > interested in talking about it. I am based in San Francisco bayarea. > > > > Cheers.. > > > > Sami > > > sami2065 at gmail.com -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.wikia.com/pipermail/search-l/attachments/20070611/f0d6fb44/attachment.html From jwales at wikia.com Mon Jun 11 07:02:57 2007 From: jwales at wikia.com (Jimmy Wales) Date: Mon, 11 Jun 2007 09:02:57 +0200 Subject: [Search-l] "directory" vs. "search engine" In-Reply-To: <4667B39B.30800@borwankar.com> References: <355a36af0706010859u34e599ecs469c26393865a07@mail.gmail.com> <4660A555.7090401@gmail.com> <70b3cf150706021406x33ee2024id44552ce2303fedb@mail.gmail.com> <355a36af0706021513i60b04785mf8d89af02c0a55ad@mail.gmail.com> <355a36af0706021513p6a818a4euf09f00ef2a9800e0@mail.gmail.com> <70b3cf150706021537l6b099f1dt64b8c97a00cda6a2@mail.gmail.com> <355a36af0706021627x2fe804dmbfbf110a7d1fb5d3@mail.gmail.com> <1609692912-1180827584-cardhu_decombobulator_blackberry.rim.net-1671855404-@bxe034.bisx.prod.on.blackberry> <20070603124337.GA11441@sethf.com> <70b3cf150706031417j26ac3eefn516c114267f7dae4@mail.gmail.com> <4667B39B.30800@borwankar.com> Message-ID: On Jun 7, 2007, at 9:28 AM, Nitin Borwankar wrote: > Working for free and working as an employee are not the only two > revenue > models for content creators. Absolutely right! And I think this contrast between "working for free" and "working as an employee" just entirely misses the point. I don't do anything for free, and I don't do anything for pay. I just do what I think is fun, and sometimes I get paid and sometimes I don't. The approach to open source software and free culture which gets hung up on this "working for free" question always leaves people baffled, because there is something fundamentally wrong with looking at it that way. Consider bowling. Some people get paid a lot of money to bowl. Suppose someone approached the bowling alley business with the belief that the objective of the business is to somehow persuade people to do "bowling work" for free. And then someone else comes along to compete, using the argument that since some people get paid a lot of money in some contexts to bowl, that a better business model is to pay people to bowl. This is clearly silly for bowling, because we have long long experience in knowing that bowling for fun and bowling for money can and should co-exist in the world. No one would think of going into a bowling alley to interview the customers with questions like "Why do you do this for free, when professional bowlers are paid to do it? Don't you feel exploited?" And the operators of successful bowling alleys know their true business: providing a place for people to do something they enjoy. They notice that bowling is both competitive and social, so they work to set up leagues of similarly skilled teams. They notice that bowlers get hungry, so they sell food. They notice that beer goes well with socializing and eating and playing a game, so they sell beer. Some people get paid to write. Others do it for free. Others are willing to pay to have an environment where the writing is supported in certain ways. From sethf at sethf.com Mon Jun 11 18:59:46 2007 From: sethf at sethf.com (Seth Finkelstein) Date: Mon, 11 Jun 2007 14:59:46 -0400 Subject: [Search-l] "directory" vs. "search engine" In-Reply-To: References: <70b3cf150706021406x33ee2024id44552ce2303fedb@mail.gmail.com> <355a36af0706021513i60b04785mf8d89af02c0a55ad@mail.gmail.com> <355a36af0706021513p6a818a4euf09f00ef2a9800e0@mail.gmail.com> <70b3cf150706021537l6b099f1dt64b8c97a00cda6a2@mail.gmail.com> <355a36af0706021627x2fe804dmbfbf110a7d1fb5d3@mail.gmail.com> <1609692912-1180827584-cardhu_decombobulator_blackberry.rim.net-1671855404-@bxe034.bisx.prod.on.blackberry> <20070603124337.GA11441@sethf.com> <70b3cf150706031417j26ac3eefn516c114267f7dae4@mail.gmail.com> <4667B39B.30800@borwankar.com> Message-ID: <20070611185946.GA18476@sethf.com> On Mon, Jun 11, 2007 at 09:02:57AM +0200, Jimmy Wales wrote: > And I think this contrast between "working for free" and "working as > an employee" just entirely misses the point. I don't do anything > for free, and I don't do anything for pay. I just do what I think > is fun, and sometimes I get paid and sometimes I don't. http://www.snopes.com/glurge/glurge.asp "What is glurge? Think of it as chicken soup with several cups of sugar mixed in: It's supposed to be a method of delivering a remedy for what ails you by adding sweetening to make the cure more appealing, but the result is more often a sickly-sweet concoction that induces hyperglycemic fits." > The approach to open source software and free culture which gets > hung up on this "working for free" question always leaves people > baffled, because there is something fundamentally wrong with looking > at it that way. I think some of the problems of analyzing open source come from too much of an emphasis on the business model of artificial scarcity applied to software. Of course, it's a common business model. But it's not the only workable business model. And it shouldn't be heretical to conclude it doesn't work well in many contexts. That's often a rather difficult point, since in itself it's tied in with many other topics, such as the artificial scarcity model applied to music or books (not the same thing), or elsewhere, the very understandable desire of business owners not to pay workers. Free culture, however, is an entirely different matter. Open-source contributions can be very economically rational for programmers in ways culture contributions really aren't - e.g. the service market for programmers is far, fa