From jpbedell at gmail.com Mon Jan 1 21:48:24 2007 From: jpbedell at gmail.com (J. Patrick Bedell) Date: Mon Jan 1 22:06:42 2007 Subject: [Search-l] search, website "insurance" through wiki financial markets Message-ID: <9918cabb0701011348g13826098k6a30d94a9800868c@mail.gmail.com> Hello search-l@wikia.com people, This is another mailing list posting that is better implemented as code (thanks for mediawiki.sf.net!), but nonetheless: one thing that wiki(a) could bring to search is the ability for wiki users to easily and automatically add primary references (such as from standard text, audio, and video formats) and use financial mechanisms to weigh them. Would you buy information currency associated with a contribution to wikipedia, and follow that demand to create a program ic trader that proactively responds to market signals associated with knowledge import and management? In an actuarial sense, the fact that a certain add is unreverted x seconds after being committed to an authoritative wiki means that it has some economic value potentially as wikitruth. FWIW, the financial mechanisms for search insurance, website insurance, and information insurance generally are being addressed at http://insurance.wikia.com ... very slowly. :) If you would like to do things like structure a money-back guarantee for choosing a search provider, please join the wikia community at insurance.wikia.com as we develop search insurance and other information insurance products from theory to reality. Thanks for your work! -- J. Patrick Bedell jpbedell@gmail.com http://infoeng.sourceforge.net http://rothbardix.blogspot.com From thomasasta at gmx.net Tue Jan 2 00:00:51 2007 From: thomasasta at gmx.net (thomasasta@gmx.net) Date: Tue Jan 2 00:00:54 2007 Subject: [Search-l] database exchange of 2 nutches (hybridity of nutch with yacy) Message-ID: <20070102000051.21080@gmx.net> Hi quite interesting projects out: http://search.wikia.com/wiki/Search_Wikia I want to suggest another one here. Nutch is used for specified customers to index specified pages, or to have an open source engine for the worldwide web. *Two* Nutch engines indexing the web make no sense. It would be useful, if all Nutch - indexing the web - can be connected together and perform a database exchange. Well you all know www.yacy.net - the p2p search engine - I do not want to suggest for nutch the same, but some interoperability of two nutch nodes. Is it possible to add / import the indexed database of nutch A to nutch B ? This import must be done manually, but why not within a network ? If we have 5 nutch engines in the world indexing the web (I do not speak for customer solutions for partials intranet webs), why then not accumulating their indexes? I want to suggest a structure, which is hybird with yacy.net Would it be possible to peform a database-structure, which is usable as well for yacy? Then the nutch index could be spread as well to yacy-nodes and get an backup there, other nutches then could add the yacy indexed media into their database. So yacy p2p is the way to exchange and backup the database of several nutches, and the nutch can backup and exchange with yacy nodes and with other nutch engines. I think therefore any nutch should run a yacy node as well and the database must be made interoperable. Would this be possible? Well, you know the emule-proejct.net filesharing structure. Or take gnutella with its ultrapeers. The emule servers support collecting urls/hashed and there is as well in emule a p2p node system called kademlia. Would such a p2p engine structure be possible, if yacy is the p2p node and nutch the Ultrapeer indexing for its own, but as well backuping its database to the p2p yacy network and getting as well from the network redundant urls ? See then the wiki-search project of the link above. As urls get a human ranking (exactly the page is ranked after it was seen with the yacy bar) the nutch database could get as well these human ranked urls over the database exchange. Any Idea, if a common database structure is possible and if nutch could implement a yacy node to held connections to the dht network of yacy, so nutch could be (as well) a yacy node? as both is java this should work? Thanks for subscribing as well to the yacy.net forums to play around with this node and toolbar and the already implemented (need to be developed) human ranking. Thanks for collaboration ideas. tom -- Der GMX SmartSurfer hilft bis zu 70% Ihrer Onlinekosten zu sparen! Ideal f?r Modem und ISDN: http://www.gmx.net/de/go/smartsurfer From toufeeqh at gmail.com Tue Jan 2 01:50:39 2007 From: toufeeqh at gmail.com (Toufeeq Hussain) Date: Tue Jan 2 01:50:41 2007 Subject: [Search-l] database exchange of 2 nutches (hybridity of nutch with yacy) In-Reply-To: <20070102000051.21080@gmx.net> References: <20070102000051.21080@gmx.net> Message-ID: Hi, On 1/2/07, thomasasta@gmx.net wrote: > *Two* Nutch engines indexing the web make no sense. > It would be useful, if all Nutch - indexing the web - can be connected together and perform a database exchange. > > Well you all know www.yacy.net - the p2p search engine - I do not want to suggest for nutch the same, but some interoperability of two nutch nodes. I guess Lucene-Hadoop has pretty much the same feature-set which you are looking for. http://wiki.apache.org/lucene-hadoop/ http://wiki.apache.org/nutch/NutchHadoopTutorial -Toufeeq -- blog @ http://toufeeq.net From wsurowiec at gmail.com Tue Jan 2 02:37:58 2007 From: wsurowiec at gmail.com (William Surowiec) Date: Tue Jan 2 02:45:09 2007 Subject: [Search-l] Welcome, and let's keep the volume low :) In-Reply-To: <4595192E.2000807@wikia.com> References: <4595192E.2000807@wikia.com> Message-ID: <4599C586.90707@gmail.com> An HTML attachment was scrubbed... URL: http://lists.wikia.com/pipermail/search-l/attachments/20070102/8c4154d2/attachment.html From Keith at Botley.net Tue Jan 2 04:37:53 2007 From: Keith at Botley.net (Keith Botley) Date: Tue Jan 2 04:52:59 2007 Subject: [Search-l] Requirements? Message-ID: <001801c72e27$c4e28410$b900a8c0@Botley.local> Bill Surowiec made a fine post to the mailing list and ended it with this comment. "I believe that determining and articulating the target in enough detail to be actionable is the first order of business." I agree with his prioritization of business as the requirements being articulated clearly for all. I have seen numerous posts to the mailing list concerning architecture which is valuable and informative but possibly not the current focus. Keith -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.wikia.com/pipermail/search-l/attachments/20070102/97f582b6/attachment.html From m0mms at t-online.de Tue Jan 2 10:16:00 2007 From: m0mms at t-online.de (m0mms@t-online.de) Date: Tue Jan 2 12:39:11 2007 Subject: [Search-l] requirements Message-ID: <1H1gga-2GyaZM0@fwd34.aul.t-online.de> An HTML attachment was scrubbed... URL: http://lists.wikia.com/pipermail/search-l/attachments/20070102/833ac944/attachment-0001.html From Loewe at bizdev.ch Tue Jan 2 14:57:58 2007 From: Loewe at bizdev.ch (Oskar G. Loewe - Focus Switzerland) Date: Tue Jan 2 15:22:45 2007 Subject: [Search-l] AW: Search-l Digest, Vol 2, Issue 1 In-Reply-To: <20070102123912.AEB5DB985F4@shannon.tpa.wikia-inc.com> Message-ID: <20070102145757.D15F7C0338@p15124325.pureserver.info> Pls. unsubscribe Freundliche Grüsse Oskar G. Loewe FOCUS SWITZERLAND AG Business Development & Consulting Alpenstrasse 1 CH-6004 Luzern Tel. 0041 41 850 51 41 Fax 0041 41 850 51 60 www.bizdev.ch -----Ursprüngliche Nachricht----- Von: search-l-bounces@wikia.com [mailto:search-l-bounces@wikia.com] Im Auftrag von search-l-request@wikia.com Gesendet: Dienstag, 2. Januar 2007 13:39 An: search-l@wikia.com Betreff: Search-l Digest, Vol 2, Issue 1 Send Search-l mailing list submissions to search-l@wikia.com To subscribe or unsubscribe via the World Wide Web, visit http://lists.wikia.com/mailman/listinfo/search-l or, via email, send a message with subject or body 'help' to search-l-request@wikia.com You can reach the person managing the list at search-l-owner@wikia.com When replying, please edit your Subject line so it is more specific than "Re: Contents of Search-l digest..." Today's Topics: 1. Danny Sullivan / Search Engine Land Q&A article (Seth Finkelstein) 2. MINERVA: Why a decentral search engine is better than a central one (thomasasta@gmx.net) 3. search, website "insurance" through wiki financial markets (J. Patrick Bedell) 4. database exchange of 2 nutches (hybridity of nutch with yacy) (thomasasta@gmx.net) 5. Re: database exchange of 2 nutches (hybridity of nutch with yacy) (Toufeeq Hussain) 6. Re: Welcome, and let's keep the volume low :) (William Surowiec) 7. Requirements? (Keith Botley) 8. requirements (m0mms@t-online.de) ---------------------------------------------------------------------- Message: 1 Date: Sat, 30 Dec 2006 23:45:44 -0500 From: Seth Finkelstein Subject: [Search-l] Danny Sullivan / Search Engine Land Q&A article To: search-l@wikia.com Message-ID: <20061231044544.GA12217@sethf.com> Content-Type: text/plain; charset=us-ascii There's of course been a lot of news coverage. But I'd like to recommend the following article by search engine expert Danny Sullivan: http://searchengineland.com/061229-193718.php This is by no means the typical media puff piece. It's worth reading especially for the references from Danny regarding the history of previous open-source search efforts, and the challenges they've faced. -- Seth Finkelstein Consulting Programmer http://sethf.com Infothought blog - http://sethf.com/infothought/blog/ Interview: http://sethf.com/essays/major/greplaw-interview.php ------------------------------ Message: 2 Date: Sun, 31 Dec 2006 12:59:18 +0100 From: thomasasta@gmx.net Subject: [Search-l] MINERVA: Why a decentral search engine is better than a central one To: search-l@wikia.com Message-ID: <20061231115918.20060@gmx.net> Content-Type: text/plain; charset="iso-8859-1" Hi next to yacy.net p2p search engine there is another one, called MINERVA, which was presented at a Google Research Congress. Here is a one hour Video, why a p2p search engine is better than a central one. A third p2p search engine is GALANX, see the link following and for 5 others, so 8 p2p search engines exist worldwide. Jim, the simple question is, if wikisari code will be based on minerva, nutch or yacy ? Let4s start the year with a good news ! ;-) Video http://video.google.com/videoplay?docid=8710122769175704670 http://www.ojobuscador.com/2006/08/11/p2p-web-search-en-google/ Website: http://www.minerva-project.org/index.html http://www.mpi-sb.mpg.de/departments/d5/software/minerva/index.html Presentation http://www.mpi-inf.mpg.de/departments/d5/software/minerva/publications.html Further Links http://www.mpi-inf.mpg.de/departments/d5/software/minerva/publications/VLDB_ Demo.pdf http://www.cs.wisc.edu/~yuanwang/doc/yuanwang_cv.pdf http://www.mpi-inf.mpg.de/departments/d5/teaching/ws04_05/Proseminar/2005-01 -18_Paper2.ppt http://pdos.csail.mit.edu/chord/ http://www.vldb2005.org/program/paper/demo/p1263-bender.pdf http://middleware05.objectweb.org/WSProceedings/demos/d9_Bender.pdf Galanx / Chora / CoOpeer / Widesource / Youserve / Odissea http://www.mpi-sb.mpg.de/~smichel/files/p2pirmichel.ppt http://iptps06.cs.ucsb.edu/talks/Zimmer.ppt http://www.cs.berkeley.edu/~grant/papers/Chora.pdf http://security.riit.tsinghua.edu.cn/share/coopeer.pdf http://www.filetransit.com/view.php?id=6780 http://www.deepwebresearch.info/ http://www.mpi-inf.mpg.de/departments/d5/teaching/ws03_04/p2p-data/01-13-wri teup2.pdf Advantages of Pure P2P Searching There are some advantages of peer searching over centralized indexing and search engines. - Distributed processing -- no need for huge server farms and enormous indexes - Freshness of the information -- peer searching is always current and doesn't get stale, unlike robot-generated indexes or human-generated directories - Modular -- no dependencies on any specific server - Ease of sharing -- does not require a publication step (create a web page or upload to a server) to share information - Anonymity -- yacy, in particular, is designed to obscure the requester's identity - File-format agnostic -- not limited to HTML or other text files, any file can be shared and found by name - Local control and flexibility -- can be implemented with security permissions and data structures - open search standard compatible -- Der GMX SmartSurfer hilft bis zu 70% Ihrer Onlinekosten zu sparen! Ideal f|r Modem und ISDN: http://www.gmx.net/de/go/smartsurfer -- Der GMX SmartSurfer hilft bis zu 70% Ihrer Onlinekosten zu sparen! Ideal f|r Modem und ISDN: http://www.gmx.net/de/go/smartsurfer ------------------------------ Message: 3 Date: Mon, 1 Jan 2007 13:48:24 -0800 From: "J. Patrick Bedell" Subject: [Search-l] search, website "insurance" through wiki financial markets To: search-l@wikia.com Cc: jpbedell@mises.com Message-ID: <9918cabb0701011348g13826098k6a30d94a9800868c@mail.gmail.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Hello search-l@wikia.com people, This is another mailing list posting that is better implemented as code (thanks for mediawiki.sf.net!), but nonetheless: one thing that wiki(a) could bring to search is the ability for wiki users to easily and automatically add primary references (such as from standard text, audio, and video formats) and use financial mechanisms to weigh them. Would you buy information currency associated with a contribution to wikipedia, and follow that demand to create a program ic trader that proactively responds to market signals associated with knowledge import and management? In an actuarial sense, the fact that a certain add is unreverted x seconds after being committed to an authoritative wiki means that it has some economic value potentially as wikitruth. FWIW, the financial mechanisms for search insurance, website insurance, and information insurance generally are being addressed at http://insurance.wikia.com ... very slowly. :) If you would like to do things like structure a money-back guarantee for choosing a search provider, please join the wikia community at insurance.wikia.com as we develop search insurance and other information insurance products from theory to reality. Thanks for your work! -- J. Patrick Bedell jpbedell@gmail.com http://infoeng.sourceforge.net http://rothbardix.blogspot.com ------------------------------ Message: 4 Date: Tue, 02 Jan 2007 01:00:51 +0100 From: thomasasta@gmx.net Subject: [Search-l] database exchange of 2 nutches (hybridity of nutch with yacy) To: nutch-dev@lucene.apache.org, search-l@wikia.com Message-ID: <20070102000051.21080@gmx.net> Content-Type: text/plain; charset="iso-8859-1" Hi quite interesting projects out: http://search.wikia.com/wiki/Search_Wikia I want to suggest another one here. Nutch is used for specified customers to index specified pages, or to have an open source engine for the worldwide web. *Two* Nutch engines indexing the web make no sense. It would be useful, if all Nutch - indexing the web - can be connected together and perform a database exchange. Well you all know www.yacy.net - the p2p search engine - I do not want to suggest for nutch the same, but some interoperability of two nutch nodes. Is it possible to add / import the indexed database of nutch A to nutch B ? This import must be done manually, but why not within a network ? If we have 5 nutch engines in the world indexing the web (I do not speak for customer solutions for partials intranet webs), why then not accumulating their indexes? I want to suggest a structure, which is hybird with yacy.net Would it be possible to peform a database-structure, which is usable as well for yacy? Then the nutch index could be spread as well to yacy-nodes and get an backup there, other nutches then could add the yacy indexed media into their database. So yacy p2p is the way to exchange and backup the database of several nutches, and the nutch can backup and exchange with yacy nodes and with other nutch engines. I think therefore any nutch should run a yacy node as well and the database must be made interoperable. Would this be possible? Well, you know the emule-proejct.net filesharing structure. Or take gnutella with its ultrapeers. The emule servers support collecting urls/hashed and there is as well in emule a p2p node system called kademlia. Would such a p2p engine structure be possible, if yacy is the p2p node and nutch the Ultrapeer indexing for its own, but as well backuping its database to the p2p yacy network and getting as well from the network redundant urls ? See then the wiki-search project of the link above. As urls get a human ranking (exactly the page is ranked after it was seen with the yacy bar) the nutch database could get as well these human ranked urls over the database exchange. Any Idea, if a common database structure is possible and if nutch could implement a yacy node to held connections to the dht network of yacy, so nutch could be (as well) a yacy node? as both is java this should work? Thanks for subscribing as well to the yacy.net forums to play around with this node and toolbar and the already implemented (need to be developed) human ranking. Thanks for collaboration ideas. tom -- Der GMX SmartSurfer hilft bis zu 70% Ihrer Onlinekosten zu sparen! Ideal f|r Modem und ISDN: http://www.gmx.net/de/go/smartsurfer ------------------------------ Message: 5 Date: Tue, 2 Jan 2007 07:20:39 +0530 From: "Toufeeq Hussain" Subject: Re: [Search-l] database exchange of 2 nutches (hybridity of nutch with yacy) To: "thomasasta@gmx.net" Cc: search-l@wikia.com, nutch-dev@lucene.apache.org Message-ID: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Hi, On 1/2/07, thomasasta@gmx.net wrote: > *Two* Nutch engines indexing the web make no sense. > It would be useful, if all Nutch - indexing the web - can be connected together and perform a database exchange. > > Well you all know www.yacy.net - the p2p search engine - I do not want to suggest for nutch the same, but some interoperability of two nutch nodes. I guess Lucene-Hadoop has pretty much the same feature-set which you are looking for. http://wiki.apache.org/lucene-hadoop/ http://wiki.apache.org/nutch/NutchHadoopTutorial -Toufeeq -- blog @ http://toufeeq.net ------------------------------ Message: 6 Date: Mon, 01 Jan 2007 21:37:58 -0500 From: William Surowiec Subject: Re: [Search-l] Welcome, and let's keep the volume low :) To: search-l@wikia.com Message-ID: <4599C586.90707@gmail.com> Content-Type: text/plain; charset="us-ascii" An HTML attachment was scrubbed... URL: http://lists.wikia.com/pipermail/search-l/attachments/20070102/8c4154d2/atta chment-0001.html ------------------------------ Message: 7 Date: Mon, 1 Jan 2007 23:37:53 -0500 From: "Keith Botley" Subject: [Search-l] Requirements? To: Message-ID: <001801c72e27$c4e28410$b900a8c0@Botley.local> Content-Type: text/plain; charset="us-ascii" Bill Surowiec made a fine post to the mailing list and ended it with this comment. "I believe that determining and articulating the target in enough detail to be actionable is the first order of business." I agree with his prioritization of business as the requirements being articulated clearly for all. I have seen numerous posts to the mailing list concerning architecture which is valuable and informative but possibly not the current focus. Keith -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.wikia.com/pipermail/search-l/attachments/20070102/97f582b6/atta chment-0001.html ------------------------------ Message: 8 Date: Tue, 02 Jan 2007 11:16:00 +0100 From: "m0mms@t-online.de" Subject: [Search-l] requirements To: search-l@wikia.com Message-ID: <1H1gga-2GyaZM0@fwd34.aul.t-online.de> Content-Type: text/plain; charset="us-ascii" An HTML attachment was scrubbed... URL: http://lists.wikia.com/pipermail/search-l/attachments/20070102/833ac944/atta chment.html ------------------------------ _______________________________________________ Search-l mailing list Search-l@wikia.com http://lists.wikia.com/mailman/listinfo/search-l End of Search-l Digest, Vol 2, Issue 1 ************************************** From peter.burden at gmail.com Tue Jan 2 17:12:30 2007 From: peter.burden at gmail.com (Peter Burden) Date: Tue Jan 2 17:12:32 2007 Subject: [Search-l] Requirements? In-Reply-To: <001801c72e27$c4e28410$b900a8c0@Botley.local> References: <001801c72e27$c4e28410$b900a8c0@Botley.local> Message-ID: On 02/01/07, Keith Botley wrote: > > Bill Surowiec made a fine post to the mailing list and ended it with this > comment. > > > > "I believe that determining and articulating the target in enough detail > to be actionable is the first order of business." > > > > I agree with his prioritization of business as the requirements being > articulated clearly for all. I have seen numerous posts to the mailing list > concerning architecture which is valuable and informative but possibly not > the current focus. > Or, perhaps, to put it another way. What's it going to do that existing engines don't do? Or in business speak what's the unique selling proposition? The underlying technology or the way the software development team is organised is not important from this point of view. I can only speak for myself - but here's some of the features I'd like from a search engine. No existing engine, AFAIK, does all of these. 1. User tweakable ranking. I.e. I can choose the parameters that control the ordering of results to meet my particular current whims and fancies. [BTW I have some serious doubts about the usefulness of page ranking.] 2. Semantic searching. I can search for "pages" that are relevant to a topic by describing the topic rather than having to think of likely combinations of keywords. [But keep the AI research community and the ontologists at arm's length please.] 3. Site searching. I can search for sites that host related material rather searching for pages. Of course I'd like features such as global coverage ( including the "deep web" if possible), filtering by page age, domain, page size, number of advertisement links, up-to-date databases etc., etc., _______________________________ > Search-l mailing list > Search-l@wikia.com > http://lists.wikia.com/mailman/listinfo/search-l > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.wikia.com/pipermail/search-l/attachments/20070102/6980a6f9/attachment.html From bill at billbeardcostarica.com Tue Jan 2 18:07:21 2007 From: bill at billbeardcostarica.com (Bill) Date: Tue Jan 2 18:14:10 2007 Subject: [Search-l] List Message-ID: Bill Beard's Costa Rica Scuba Diving & Adventure Tours costarica@diveres.com Toll free: 877 853-0538 Phone 954-453-5044, Fax 954-351-9740 www.billbeardcostarica.com Free Informative Newsletter ? Come to Costa Rica For The Natural Beauty... Stay For The Adventure! -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/enriched Size: 493 bytes Desc: not available Url : http://lists.wikia.com/pipermail/search-l/attachments/20070102/390f1dd5/attachment.bin From thomasasta at gmx.net Tue Jan 2 18:45:05 2007 From: thomasasta at gmx.net (thomasasta@gmx.net) Date: Tue Jan 2 18:45:07 2007 Subject: [Search-l] Requirements? In-Reply-To: References: <001801c72e27$c4e28410$b900a8c0@Botley.local> Message-ID: <20070102184505.202760@gmx.net> see the comments between the lines... -------- Original-Nachricht -------- Datum: Tue, 2 Jan 2007 17:12:30 +0000 Von: "Peter Burden" An: "Keith Botley" Betreff: Re: [Search-l] Requirements? > On 02/01/07, Keith Botley wrote: > > > > Bill Surowiec made a fine post to the mailing list and ended it with > this > > comment. > > > > > > > > "I believe that determining and articulating the target in enough detail > > to be actionable is the first order of business." > > > > > > > > I agree with his prioritization of business as the requirements being > > articulated clearly for all. I have seen numerous posts to the mailing > list > > concerning architecture which is valuable and informative but possibly > not > > the current focus. > > > The goal is, to let Millions of users participating in ranking websites, to give their judge about content to other content searchers. This is could be done in a central stop shop, but participating is like in wiki immediately, and seeing the results. > > Or, perhaps, to put it another way. What's it going to do > that existing engines don't do? Or in business speak > what's the unique selling proposition? > Core competence is the community aspect and the participation of each user adding his vote for a website. > The underlying technology or the way the software > development team is organised is not important from > this point of view. Well you could do it in a central way, to let users built an url-directory liek dmoz.org or to let user vote vor website.. then Wikisari is jsut a toolbar in the browser, which fetches the rating of the shown website. And then users can search in the ranking built upon the users votes. but the database of the search engine is not open to everyone, only a few swql serveradmins and nutch freaks can install a clone. Cloning of a central stop shop is an illusion! Wikipedia has not grown, because of freenet.de or OLPC or any other open media DVD integrated wikipedia articles. It is the immediate participation of the surfer. > I can only speak for myself - but here's some of the > features I'd like from a search engine. No existing > engine, AFAIK, does all of these. > > 1. User tweakable ranking. I.e. I can choose the > parameters that control the ordering of results to > meet my particular current whims and fancies. > [BTW I have some serious doubts about the usefulness > of page ranking.] > Yacy.net has both, you can rank pages with a plus and a minus, the next version will have the rank buttons in the already existing toolbar. Even more, you have 12-15 categories, to adjust the raking of the pages, e.g. for the date or for word spelling or whatever. It is allready in !!! and you can adjust it for your needs, if you search an enhanced search experience with yacy. > 2. Semantic searching. I can search for "pages" > that are relevant to a topic by describing the topic > rather than having to think of likely combinations of > keywords. [But keep the AI research community and > the ontologists at arm's length please.] yacy offers a semantic component and as well for the snippets of the websites, which are done "live", not from a cache, the best snippet is calculated. See and try it, as well above mentioned adjustment of the search criteria stressment, you can adjust the search ranking for your best fits to semantic search. > 3. Site searching. I can search for sites that host > related material rather searching for pages. > Well, you start crawling with one url, which is mostly the domain. To get the domain, which fits your interest totally, then you call it portal, e.g. for pregnancy. Well, you will find it very soon with every search, if a few pages are indexed. as you as well suf then to the Domain, which may be a Portal. as yacy often starts indexing in a peer with wilipedia urls or dmoz.org urls, you very soon get pageresults, which belong to a very relevant site and a very relevant doamain or even the portal: pregnancy.org As many nodes index relevant web-directories, it is a better way than google just surfing from link to link, I guess in a p2p engine many users insert their bookmarks to such portals and give so a good start point. BTW, yacy offers to save your bookmarks and make them public as a "tip" to other nodes. so you can search as well in the popula rbookmarks of others... it is so featurerich.. try it out and post ideas, how to make it better to the forum, what should fit your needs for pregnancy.. ;-) > Of course I'd like features such as global coverage ( > including the "deep web" if possible), filtering by page > age, domain, page size, number of advertisement links, > up-to-date databases etc., etc., No search eninge allows to customize the search rank criteria, in the mentioned p2p search engine this is all possible, and: no search engine offers the ranking of the newest craweld url on top of the ranking, only yacy allows this... So search for your name and get the latest page of the term "wikisari", this offers even not google.. Though the database of yacy needs to grow.. this is why I suggest wikifoundation to just have a look at it, if another decision. e.g. for minerva or nutch or an own development is taken, then i will look how to support this, but I guess all is a process of communication, participation and it is a process, so let?s talk till februar here on the list, before an official statement is posted.. or a projectplan is discussed with milesones and then personnel and projectteammembers with the fitting coding experiences are found, but I guess, they need java anyway... :-)) -- Der GMX SmartSurfer hilft bis zu 70% Ihrer Onlinekosten zu sparen! Ideal f?r Modem und ISDN: http://www.gmx.net/de/go/smartsurfer From aerik at thesylvans.com Wed Jan 3 19:50:49 2007 From: aerik at thesylvans.com (Aerik Sylvan) Date: Wed Jan 3 19:50:52 2007 Subject: [Search-l] Interesting presentation on Lucene, Nutch Message-ID: <355a36af0701031150r5119fb6ej8a3a931f6be33eeb@mail.gmail.com> http://www.haifa.ibm.com/Workshops/ir2005/papers/DougCutting-Haifa05.pdf -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.wikia.com/pipermail/search-l/attachments/20070103/7c4ec90c/attachment.html From pcleddy at gmail.com Wed Jan 3 20:47:25 2007 From: pcleddy at gmail.com (Paul Charles Leddy) Date: Wed Jan 3 20:47:29 2007 Subject: [Search-l] Care Message-ID: Something to learn from wikipedia and apply to this human-driven search "engine": Articles in wikipedia are only as good as the group of people that currently CARE about them. As the community of those who give-a-care changes, the article can swing this way or that. There is a possibility the Talk pages show this history together with the history. From jwales at wikia.com Wed Jan 3 21:02:59 2007 From: jwales at wikia.com (Jimmy Wales) Date: Wed Jan 3 22:06:24 2007 Subject: [Search-l] First steps to getting it right... Message-ID: <459C1A03.9030708@wikia.com> One of the things that I believe in passionately is genuine human communities, as opposed to "crowdsourcing". What do I mean by that? I mean, people who get to know each other, over time, as real human beings, and through that process, gain a sense of trust and responsibility for each other and for the task at hand. So for me, if we are to succeed here, this is the first place we need to focus attention... This project is different from a traditional wiki, in which people are coming to know each other through the process of writing. On Wikipedia, on Uncyclopedia, on world.wikia.com, on thousands of other wikis around the web, people write, rewrite, debate on the talk page, and come to some mutual understanding and trust. The process is messy, but that's because the process is human, and genuinely human processes are going to be messy and complex. I think we need to design for 3 levels of process here, because of the differences that I see between what we are trying to do and what a normal wiki is trying to do. First, there is the spider and the ranking algorithms. These need to be public and transparent. They need to be controllable by the community. But the very nature of this machine-centric process means that a lot of the heavy lifting here will be on the level of *code*, i.e., this is a community of developers. Second, there is the possibility and necessity of massive feedback from the general public. This can be in the form of voting, thumbs up / thumbs down on links, digg-style pages, etc. My own view is that these things should be simple, intuitive, and the means by which we encourage people to get more involved. HOWEVER, because these are precisely the kinds of mechanisms that can be "gamed" by spammers and other ne'er-do-wells, they have to be treated with a significant amount of caution and be public, transparent, and controllable by the community. And finally, the bit that I think makes it all work: the space for the core community of users, users who are not necessarily programmers, but who can make serious editorial judgments in a neutral way through the process of open discussion and debate. ===== I can explain each of these levels and what I see as the reason for them, drawing on my experiences with search in the past, my experiences with wikis, my observations and experiences with dmoz, etc. But that can come out over time. For now I just want to point out that the largest amount of skepticism about what we are going to try to accomplish here is driven by the inherent issue of spammers. There are huge incentives for people to try to abuse our good will and we have to anticipate and expect that. But, unlike many of the skeptics who think that this is impossible, I am very confident that if we can build a genuine community and give ourselves as a community the tools we need, then we can deal with this issue without a lot of trouble. Tomorrow I will write more about how I see the core design working. --Jimbo From xixtas at gmail.com Wed Jan 3 22:36:59 2007 From: xixtas at gmail.com (Randy Wilson) Date: Wed Jan 3 22:37:01 2007 Subject: [Search-l] Relationship of Community and Search Engine Message-ID: <10f6add00701031436j2bb4a717oa02c6c0e977fa0a@mail.gmail.com> Hello. I am an IT manager and entrepreneur in real life and an editor at the Open Directory Project and also at Wikibooks and Wikipedia. I don't understand the vision. I guess one fundamental thing that would help me understand is to describe how the community will interact with the engine. Is this a project just for developers, or is there a role for content reviewers and classifiers as well? Will content reviewers mainly be focused on blacklisting and whitelisting? Disambiguation pages to discern user intent is not a new concept in search engines. Will this project include a classification component? The statement that non-spam is "more finite" seems troublesome to me, because though it is undoubtedly true (accepting for the moment that there are levels of finiteness), it implies that much of what is out there to be classified can be ignored. I think that there is substantially more unique and valuable information than spam on the web, and that valuable information is not limited to a few wiki*.org domains. The hardest thing about eating an elephant may be figuring out where to start. Right now it seems like everyone is standing around looking hungry, and no-one even has a fork, much less a knife, or barbeque grill. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.wikia.com/pipermail/search-l/attachments/20070103/49db429e/attachment.html From jwales at wikia.com Wed Jan 3 22:55:23 2007 From: jwales at wikia.com (Jimmy Wales) Date: Wed Jan 3 22:55:49 2007 Subject: [Search-l] Relationship of Community and Search Engine In-Reply-To: <10f6add00701031436j2bb4a717oa02c6c0e977fa0a@mail.gmail.com> References: <10f6add00701031436j2bb4a717oa02c6c0e977fa0a@mail.gmail.com> Message-ID: <459C345B.8010605@wikia.com> Randy Wilson wrote: > I don't understand the vision. I guess one fundamental thing that would > help me understand is to describe how the community will interact with > the engine. Is this a project just for developers, or is there a role > for content reviewers and classifiers as well? Will content reviewers > mainly be focused on blacklisting and whitelisting? My design includes room for content reviewers and classifiers. some of that will be focused on blacklisting and whitelisting, but I think the community can and should do a lot more than that. Open dialog and discussion is incredibly powerful for getting thoughtful results in any process... and this will be encouraged. > Disambiguation pages to discern user intent is not a new concept in > search engines. Will this project include a classification component? > The statement that non-spam is "more finite" seems troublesome to me, > because though it is undoubtedly true (accepting for the moment that > there are levels of finiteness), it implies that much of what is out > there to be classified can be ignored. I think that there is > substantially more unique and valuable information than spam on the web, > and that valuable information is not limited to a few wiki*.org domains. Yes, there is a great deal of valuable information out there, and it is not limited to a few domains. > The hardest thing about eating an elephant may be figuring out where to > start. Right now it seems like everyone is standing around looking > hungry, and no-one even has a fork, much less a knife, or barbeque grill. That's ok. The wiki world is messy. We eat with our hands. ;-) --Jimbo From thomasasta at gmx.net Thu Jan 4 00:04:10 2007 From: thomasasta at gmx.net (thomasasta@gmx.net) Date: Thu Jan 4 00:04:13 2007 Subject: [Search-l] Relationship of Community and Search Engine In-Reply-To: <459C345B.8010605@wikia.com> References: <10f6add00701031436j2bb4a717oa02c6c0e977fa0a@mail.gmail.com> <459C345B.8010605@wikia.com> Message-ID: <20070104000410.118560@gmx.net> Hi good question and ideas. Users should insert pages into the search index and users should rate urls /pages and could as well add comments to each page. This is a simple concept: - one peer/user one vote for one url/page - scala: ++ / + / - / - - Furthermore a Comment for that page-url could be written. We could all store it in a database. Yacy.net, if it is allowed to come back to this advanced tool, has already the rating of urls and as well the concept to store such comments, in the DHT database. So anyone can not only call urls for indexing, each peers can index them itself with the local node and each surfer/node-user can rank in a toolbar each viewd page. Furthermore yacy allows to create bookmarks, and I can make them public, and so an index is created of suggested favourite site in the p2p network This function does not offer a central machine to lookup the made public bookmarks of nearest peers. so quite a good concept. And you ask what can be done ? If you are a coder, help coding yaxy in jave. If you can XUl or C++, then lets focus on the toolbar for yacy, there is the need for a i-explorer toolbar, because the one is for mozilla. If you can not code, then just run a yacy peer and try the search expierience. Play around with this system and lean all features, you can do with it. Everyone, which is not installing a testmachine, (of which search engine ever, should unsubscribe from the list or send feedback to improvement of the mentioned applications nutch or yacy). so.. install something, code something and give us feedback. Furthermore we are all interested in the paper of Jimmy, which he announced to get a decision between nutch and yacy.. but I guess it is sensefull to have still a period of dicussion, to get ideas and votes out of the community and as well the open question to have a joint venture with the coding teams and the wikimedia foundation has to be organized, So I hope we get an answer after some of the next emetings of the wikipedia foundation. But really... Jim just waits for the servers to be sent... ;-)) Kind regards -------- Original-Nachricht -------- Datum: Wed, 03 Jan 2007 17:55:23 -0500 Von: Jimmy Wales An: Randy Wilson Betreff: Re: [Search-l] Relationship of Community and Search Engine > Randy Wilson wrote: > > I don't understand the vision. I guess one fundamental thing that would > > help me understand is to describe how the community will interact with > > the engine. Is this a project just for developers, or is there a role > > for content reviewers and classifiers as well? Will content reviewers > > mainly be focused on blacklisting and whitelisting? > > My design includes room for content reviewers and classifiers. some of > that will be focused on blacklisting and whitelisting, but I think the > community can and should do a lot more than that. Open dialog and > discussion is incredibly powerful for getting thoughtful results in any > process... and this will be encouraged. > > > Disambiguation pages to discern user intent is not a new concept in > > search engines. Will this project include a classification component? > > The statement that non-spam is "more finite" seems troublesome to me, > > because though it is undoubtedly true (accepting for the moment that > > there are levels of finiteness), it implies that much of what is out > > there to be classified can be ignored. I think that there is > > substantially more unique and valuable information than spam on the web, > > and that valuable information is not limited to a few wiki*.org domains. > > Yes, there is a great deal of valuable information out there, and it is > not limited to a few domains. > > > The hardest thing about eating an elephant may be figuring out where to > > start. Right now it seems like everyone is standing around looking > > hungry, and no-one even has a fork, much less a knife, or barbeque > grill. > > That's ok. The wiki world is messy. We eat with our hands. ;-) > > --Jimbo > _______________________________________________ > Search-l mailing list > Search-l@wikia.com > http://lists.wikia.com/mailman/listinfo/search-l -- Der GMX SmartSurfer hilft bis zu 70% Ihrer Onlinekosten zu sparen! Ideal f?r Modem und ISDN: http://www.gmx.net/de/go/smartsurfer From aerik at thesylvans.com Thu Jan 4 00:22:57 2007 From: aerik at thesylvans.com (Aerik Sylvan) Date: Thu Jan 4 00:23:00 2007 Subject: [Search-l] Relationship of Community and Search Engine In-Reply-To: <10f6add00701031436j2bb4a717oa02c6c0e977fa0a@mail.gmail.com> References: <10f6add00701031436j2bb4a717oa02c6c0e977fa0a@mail.gmail.com> Message-ID: <355a36af0701031622o244994c7w60fa71d3ba8cb49f@mail.gmail.com> On 1/3/07, Randy Wilson wrote: > > > I don't understand the vision. I guess one fundamental thing that would > help me understand is to describe how the community will interact with the > engine. Is this a project just for developers, or is there a role for > content reviewers and classifiers as well? (etc...) > > The hardest thing about eating an elephant may be figuring out where to > start. Right now it seems like everyone is standing around looking hungry, > and no-one even has a fork, much less a knife, or barbeque grill. > > Randy, I think that's kind of the point: This is a grand vision, and there are no details (or very few?) that are agreed upon yet. I think the only certainty is a vision that a search engine can have more / different community involvement than what is generally out there now, and this is a desired outcome. It's a big vision. Instead of asking Jimmy, or everyone else, what we're doing, I think at this stage is appropriate to say how you think the vision should look. So, yeah, everyone is standing around hungry. But we don't know that it's an elephant, I think we're still figuring out what we're hunting for. What do you want to hunt for? Aerik -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.wikia.com/pipermail/search-l/attachments/20070104/357e48f5/attachment-0001.html From peter.burden at gmail.com Thu Jan 4 00:45:57 2007 From: peter.burden at gmail.com (peter burden) Date: Thu Jan 4 00:46:05 2007 Subject: [Search-l] First steps to getting it right... In-Reply-To: <459C1A03.9030708@wikia.com> References: <459C1A03.9030708@wikia.com> Message-ID: <459C4E45.6050907@gmail.com> Jimmy Wales wrote: > > One of the things that I believe in passionately is genuine human > communities, as opposed to "crowdsourcing". What do I mean by that? > > I mean, people who get to know each other, over time, as real human > beings, and through that process, gain a sense of trust and > responsibility for each other and for the task at hand. So for me, if > we are to succeed here, this is the first place we need to focus > attention... > > This project is different from a traditional wiki, in which people are > coming to know each other through the process of writing. On > Wikipedia, on Uncyclopedia, on world.wikia.com, on thousands of other > wikis around the web, people write, rewrite, debate on the talk page, > and come to some mutual understanding and trust. The process is > messy, but that's because the process is human, and genuinely human > processes are going to be messy and complex. > > I think we need to design for 3 levels of process here, because of the > differences that I see between what we are trying to do and what a > normal wiki is trying to do. > > First, there is the spider and the ranking algorithms. These need to > be public and transparent. They need to be controllable by the > community. But the very nature of this machine-centric process means > that a lot of the heavy lifting here will be on the level of *code*, > i.e., this is a community of developers. Yes. These are important basic issues. Decisions here will also fundamentally affect performance. See http://www.baselinemag.com/article2/0,1540,1985046,00.asp for how a major engine approaches the engineering issues. Although initial traffic is unlikely to be at the 200M + queries a day level design should be scalable. Writing a good spider (especially one that handles the increasing number of CMS based dynamic sites properly) is a non-trivial task. > Second, there is the possibility and necessity of massive feedback > from the general public. This can be in the form of voting, thumbs up > / thumbs down on links, digg-style pages, etc. My own view is that > these things should be simple, intuitive, and the means by which we > encourage people to get more involved. HOWEVER, because these are > precisely the kinds of mechanisms that can be "gamed" by spammers and > other ne'er-do-wells, they have to be treated with a significant > amount of caution and be public, transparent, and controllable by the > community. This could be the "killer" but first we need to learn the lessons of DMOZ. An SE that has semantic knowledge could work very well. Community semantic tagging (using a defined "ontological" scheme) would be an excellent thing - and difficult to see how it could effectively spammed. By allowing search users to specify their own ranking weightings (and possibly algorithms) at search time would provide a further opportunity for the community to experiment and feed back their experiences. I do not think community voting on links etc., is a good idea. I can't see it scaling. Community input should be directed towards tuning and developing ranking algorithms and spam detection mechanisms. My main concern relates to the scale of the whole enterprise. Are we intending to outdo you know who (as well as Yahoo and MSN)? If so a long careful look at the numbers is called for. > > And finally, the bit that I think makes it all work: the space for the > core community of users, users who are not necessarily programmers, > but who can make serious editorial judgments in a neutral way through > the process of open discussion and debate. > > ===== > > I can explain each of these levels and what I see as the reason for > them, drawing on my experiences with search in the past, my > experiences with wikis, my observations and experiences with dmoz, > etc. But that can come out over time. > > For now I just want to point out that the largest amount of skepticism > about what we are going to try to accomplish here is driven by the > inherent issue of spammers. There are huge incentives for people to > try to abuse our good will and we have to anticipate and expect that. > But, unlike many of the skeptics who think that this is impossible, I > am very confident that if we can build a genuine community and give > ourselves as a community the tools we need, then we can deal with this > issue without a lot of trouble. > > Tomorrow I will write more about how I see the core design working. > > --Jimbo > _______________________________________________ > Search-l mailing list > Search-l@wikia.com > http://lists.wikia.com/mailman/listinfo/search-l > From aerik at thesylvans.com Thu Jan 4 00:49:26 2007 From: aerik at thesylvans.com (Aerik Sylvan) Date: Thu Jan 4 00:49:29 2007 Subject: [Search-l] First steps to getting it right... In-Reply-To: <459C1A03.9030708@wikia.com> References: <459C1A03.9030708@wikia.com> Message-ID: <355a36af0701031649t7b226673oa338472be3456e7@mail.gmail.com> Hi Jimmy! Nice outline. I was very happy to see this project revived. Waaay back when the original version of Wikia was fading we'd batted around some ideas (you and I and Angela exchanged a few emails), and it sounds like those are still valid. I think it's worth looking at that project, and discussing the things that did and did not work, in addition to discussing current search engine approaches, etc. I think you hit a lot of key points with your note, but I think the gentleman (sorry, I forget your name) who asked about the value proposition asked a key one as well. Since people are on this list, or looking at the wiki, they must be bought into the community idea. But I think discussing the value proposition - is a key ingredient to forming this thing. Very briefly, my concept of the value proposition is very simply this: Search Engine algorithms are only so good, and will only be so good, until we have real AI that can actually understand context ("This page is about this and such"). Right now computers can sort of approximate that with semantic algorithms, but only human beings can really do the job. (And incidentally, this is why blackhat SEO guys can game the engines). So, I think the value proposition is the promise of more relevant results and less garbage disguised as content. Along the way we might add some cools bells and whistles. On 1/3/07, Jimmy Wales wrote: > > > One of the things that I believe in passionately is genuine human > communities, as opposed to "crowdsourcing". What do I mean by that? > > (etc...) I think we need to design for 3 levels of process here, because of the > differences that I see between what we are trying to do and what a > normal wiki is trying to do. > I think what you're saying here is really key: 3 levels of processes. A very distinct difference from Wikipedia. Evaluating /categorizing/tagging/whatever existing content as opposed to writing content. Second, there is the possibility and necessity of massive feedback from > the general public. This can be in the form of voting, thumbs up / > thumbs down on links, digg-style pages, etc. My own view is that these > things should be simple, intuitive, and the means by which we encourage > people to get more involved. I think that this is going to the the point of most contention: how to involve the "general public". I also think this is where the magic happens. And finally, the bit that I think makes it all work: the space for the > core community of users, users who are not necessarily programmers, but > who can make serious editorial judgments in a neutral way through the > process of open discussion and debate. I may be misunderstanding you, but I really think that in order to have a chance of doing something really different in search engine (sifting through 10's of millions, 100's of millions, billion? of documents) that we need to think of the community as being the masses of the general public. The "core users" cannot possibly hope to digest that much information in a way that generates enough data to really have an impact on search results. I think anything short of enabling a gazillion users to very (VERY) easily participate in the improvement of search relevance will fall short of being a truly community driven search engine. Not that the "core users" are not important - they're critical - but they will be there and will do critical work almost as a given. For now I just want to point out that the largest amount of skepticism > about what we are going to try to accomplish here is driven by the > inherent issue of spammers. > Yup. But I think the solution to that problem - the spammers - has to be part of the value proposition. Better, more relevant results must be our holy grail, otherwise what's our crusade for? Aerik -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.wikia.com/pipermail/search-l/attachments/20070104/8991fe7e/attachment.html From jwales at wikia.com Thu Jan 4 00:51:34 2007 From: jwales at wikia.com (Jimmy Wales) Date: Thu Jan 4 00:51:59 2007 Subject: [Search-l] First steps to getting it right... In-Reply-To: <459C4E45.6050907@gmail.com> References: <459C1A03.9030708@wikia.com> <459C4E45.6050907@gmail.com> Message-ID: <459C4F96.6010104@wikia.com> peter burden wrote: > By allowing search users to specify their own ranking weightings (and > possibly algorithms) > at search time would provide a further opportunity for the community to > experiment and > feed back their experiences. Yes, although realistically this should be available upon request, rather than being the default of course. Most of the time when I search, I just want to search, I don't want to fiddle with algorithms. But absolutely, having the possibility for people who are into experimentation to be able to do so is great. :) > I do not think community voting on links etc., is a good idea. I can't > see it scaling. Community > input should be directed towards tuning and developing ranking > algorithms and spam detection > mechanisms. Just to be clear, I generally agree with this. I do think having a way for ordinary users, who are not prepared to get involved heavily, to add their knowledge to the mix is crucial. In my model, these "thumbs up / thumbs down" things play a secondary and advisory role to the community. > My main concern relates to the scale of the whole enterprise. Are we > intending to outdo you know who (as well as Yahoo and MSN)? If so a long careful look at the > numbers is called for. Eh, I dunno. What I would anticipate here is that with all open source algorithms and software, as well as open indexing (i.e. making all the data available freely as well), we will see a ton of people borrowing our technology for all kinds of purposes. That's what a healthy free culture ecosystem should look like, I think. :) --Jimbo From jwales at wikia.com Thu Jan 4 00:57:03 2007 From: jwales at wikia.com (Jimmy Wales) Date: Thu Jan 4 00:57:29 2007 Subject: [Search-l] First steps to getting it right... In-Reply-To: <355a36af0701031649t7b226673oa338472be3456e7@mail.gmail.com> References: <459C1A03.9030708@wikia.com> <355a36af0701031649t7b226673oa338472be3456e7@mail.gmail.com> Message-ID: <459C50DF.5020700@wikia.com> Aerik Sylvan wrote: > I think you hit a lot of key points with your note, but I think the > gentleman (sorry, I forget your name) who asked about the value > proposition asked a key one as well. Since people are on this list, or > looking at the wiki, they must be bought into the community idea. But I > think discussing the value proposition - is a key ingredient to forming (Aerik wrote a nice summary of search quality as a value proposition.) To add to this, let me throw out what emotionally rings more strongly with me as a value proposition here: search is a fundamental part of the infrastructure of the net, and as with all the other fundamental parts of the infrastructure of the net, it needs to be *transparent*, it needs to be *open*, we need to be able to look under the hood (as a society) and see what is what, and why things are ranked as they are. Here's something excellent from Blake Ross of Firefox fame: http://www.blakeross.com/2006/12/25/google-tips/ What's great about Firefox? Well, first of all, it is just a *better browser*... that's like Aerik's value proposition of search quality. But at a deeper level what excites me about Firefox is that it is *free software*. And as such, it is the foundation for all kinds of interesting experimentation that will push the net forward... (Flock browser for example). Free knowledge, a free world, requires an open transparent and free (in the sense of GNU) infrastructure. I want us to help build that. --Jimbo p.s. Nutch and Lucene are critical building blocks, and so perhaps are some of the other projects that have been mentioned here. From charmtg at yahoo.com Thu Jan 4 07:09:03 2007 From: charmtg at yahoo.com (Charlene Wright) Date: Thu Jan 4 07:15:46 2007 Subject: [Search-l] First steps to getting it right... Message-ID: <20070104070903.50630.qmail@web37206.mail.mud.yahoo.com> Hi All, I'd just like to throw out some random thoughts, I don't necessarily have time to organize or "proofread" this, I'll just try to get the main things in, and not worry too much about presentation right now. I did just a little bit of research today at a local bookstore which allows reading on-site without having to purchase, a chapter in one book and some parts mostly at the end of another book. (1) Information Architecture for the World Wide Web O'Reilly, ISBN 1-56592-282-4 Louis Rosenfeld & Peter Morville (c) Feb 1998 1st Edition Chapter 6 (2) Search Engine Optimization for Dummies Wiley, ISBN 0-7645-7658-6 Peter Kent (c) 2004 (possibly 3rd printing?) There was a screenshot at (1 page 107) of a search engine screen similar to the kind of search engine I have always dreamed of having, if I had a search engine interface like that I'd be the happiest searcher in the world. It had a query box at the top of the page, then under that a numbered list of previous searches. The list looked something like this: (3) 2 and 1 (2) keyword (diabetes) (1) subject (medical diagnosis) The query box at the top contained something like "3 and date(>2000)". Now, not everyone is going to want or need functionality at this sort of level, but I'm very comfortable with boolean operators, and if I had a full-web search engine interface like this, I'd be in hog heaven. I've never used a search engine with *this* much flexibility -- Yes, I've used boolean operators, and very successfully, but I've never seen an interface where I could combine previous entire queries with boolean operators. [At this point someone will probably give me links to at least 12 search engines that allow this.] A couple of thoughts I'd like to add to this concept. First, and most importantly, this brings up a very good point: diversity, and flexibility. We need to have a diverse set of interfaces; some users will want a very simple text box with a Go button, some users will want a form sort of interface with some common fields shown, and some users will want a Full-Throttle interface like the one above. It would be great to be able to offer this sort of flexibility. In the City of Perfectville, we could even offer some sort of API which would allow Anyone to develop their own interface(s), and we could integrate some of the more popular interfaces that people come up with into the main site. In this way, we can leverage not only the power of the masses to create better results, but also to create better interfaces. While I was considering what I'm calling the Full Throttle interface described above, I thought of a possible expansion on the idea, making it even MORE flexible (if you can imagine that). Ala template parameter syntax in MediaWiki, I'm thinking of being able to add options to search fields in the query, something like date(>2000|ascending) for search results with dates on or after 2000-01-01, sorted in ascending (chronological) order, vs date(>2000|descending) with reverse chronological order (newest first). Another usage could be author(|asc) where there's no filter criteria for author, but the results will be sorted in ascending order by author. A search could be made case-sensitive as regards a particular field (but not the other search fields) using keyword(whatever|cs). At (1 page 117) is an example interface that would allow the user to tweak the relevance algorithm. In that interface the user picks an Importance (Low, Medium, or High) for each of several aspects: finding all the search terms, the number of times a search time appears on the page, how early in the text a search term appears, whether the term appears in the title, the proximity of the search terms to each other, and whether the terms appear in the order given. This would give Peter what he was describing, and takes the openness that Jimbo wants one step further: not only can the user tell how the relevance was determined, but they can also change it if they have a reason to. (1 page 121) points out the need for what I'm calling a Query Help Desk. Sure, we're going to have lots of human input into the database, but what if all that still doesn't get the user what they are really looking for? Then we can take the human contribution to its ultimate conclusion: give the user a way to submit a description of what they are looking for, and how it differs from what they got, to a real (group of) human(s) who can assist in building the right query. Yeah, this has a potential to generate a HUGE FLOOD, but hey, no one thought we could keep up with vandalism on Wikipedia either, but we do a pretty good job in general (at least I think so). (1 pages 125 to 129) present "Search Zones". What I took away from this, is it would be useful, in addition to being able to filter for keywords and author/date/title and so on, to have categorization and be able to narrow the results by category. Some resources mentioned in (2) that I found interesting: Google Zeitgeist, searchenginewatch.com, currentwisdom.com, SearchEngineBulletin.com. That's all for now (and I'm sure plenty, as I'm lucky if even 2 people are still reading this far). Thank you, Charm ----- Original Message ---- From: Jimmy Wales To: peter burden Cc: search-l@wikia.com Sent: Wednesday, January 3, 2007 5:51:34 PM Subject: Re: [Search-l] First steps to getting it right... peter burden wrote: > By allowing search users to specify their own ranking weightings (and > possibly algorithms) > at search time would provide a further opportunity for the community to > experiment and > feed back their experiences. Yes, although realistically this should be available upon request, rather than being the default of course. Most of the time when I search, I just want to search, I don't want to fiddle with algorithms. But absolutely, having the possibility for people who are into experimentation to be able to do so is great. :) > I do not think community voting on links etc., is a good idea. I can't > see it scaling. Community > input should be directed towards tuning and developing ranking > algorithms and spam detection > mechanisms. Just to be clear, I generally agree with this. I do think having a way for ordinary users, who are not prepared to get involved heavily, to add their knowledge to the mix is crucial. In my model, these "thumbs up / thumbs down" things play a secondary and advisory role to the community. > My main concern relates to the scale of the whole enterprise. Are we > intending to outdo you know who (as well as Yahoo and MSN)? If so a long careful look at the > numbers is called for. Eh, I dunno. What I would anticipate here is that with all open source algorithms and software, as well as open indexing (i.e. making all the data available freely as well), we will see a ton of people borrowing our technology for all kinds of purposes. That's what a healthy free culture ecosystem should look like, I think. :) --Jimbo _______________________________________________ Search-l mailing list Search-l@wikia.com http://lists.wikia.com/mailman/listinfo/search-l __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com From jeremie at jabber.org Thu Jan 4 08:12:58 2007 From: jeremie at jabber.org (jer) Date: Thu Jan 4 08:18:16 2007 Subject: [Search-l] Information wants to be _found_ Message-ID: The timing of all this is a bit eerie, it was exactly eight years ago today that I announced Jabber (on /.) and now here I am again even more excited than I was then. A few weeks ago I finally began sending around to friends a few early drafts of Atlas (the project's code name), and the overlap in vision with what Jimbo has spoke of is astounding and motivating. I don't think it's coincidence, it's just the right time for this. I don't believe there is another single magic search algorithm or new breakthrough technique that will define the next generation of search, instead it's all of them working together and each doing their own part best in an open and transparent way. I'm going to take a strong starting position in my contribution to the overall ideals, I want to personally help establish a simple and powerful foundation of protocols. A core syntax and set of languages that everyone can speak, and much like a wiki, designed to evolve adapt and grow openly. The Internet itself is a very simple set of rules and basic protocols that allow independent networks to connect and appear as a single system. The ideas driving me to originally design Atlas are based on the same foundational guidelines applied to search: * Enabling independent search technologies to unify as a single system by speaking the same basic protocols * Every related project, yacy, nutch, lucene, and even established or niche search players, can begin to support some simple open protocols and connect with each other in a larger ecosystem. * Groups can organize around function, content, media types, indexing algorithms, ranking, human input and feedback, or even real world relationships (local user groups). * Everyone supports unified interfaces to collaboratively (or competitively) present a new kind of search engine. I'm diligently working on my drafts for Atlas and intend to present them here soon, the core concepts being: organization into many independent Networks, common protocols connecting them, strong reputation systems for checks and balances, and operating openly in a free market. I've been working on many parts of this for years and I want to ensure I've boiled everything down to its absolute simplest form to start - after it's public it's easy to make things more complicated and almost impossible to further simplify them. Thanks Jimbo for galvanizing the effort , looking forward to the future that this group is going to help influence, Jer From prometheus at freesurf.ch Thu Jan 4 08:08:00 2007 From: prometheus at freesurf.ch (Prometheus) Date: Thu Jan 4 08:28:29 2007 Subject: [Search-l] Discussion on including users References: <20070104071549.3D650B98639@shannon.tpa.wikia-inc.com> Message-ID: <010201c72fd7$75182820$0301a8c0@your1ca50768ba> As I have not contributed to this list yet, just a few words about myself. I am professionally and privately using search engines heavily and am absolutely unsatisfied with this world as it is right now and would highly welcome a Wikia project. Further, I have about 25 years of IT experience, particularly related to handling data smartly and swiftly. Hence, I might be able to contribute. If I read the Jan-4 threads, I think most of the key problems are outlined (mass participation vs. small circle, spamming risk, technology). While I think the technology bits will be covered rather easily over time, the key question of "How do we make sure that we get relevant results quickly without falling prey to ill-doers?" needs resolution on a more abstract level. Assumption #1: Wikia needs mass participation ------------------------------------------------ As others do, I cannot imagine that a Wikia project would thrive without everybody who wanted to vote about the relevance of results in the easiest way possible. No other way will enable Wikia to cope with the amount of available information. In order to enable this (possibly even without any kind of subscription at all), we need to accept the risk of heavy distortion of results from wrongdoers (with some correction measures). In this field, I see two ways of this happening, being a) the technology driven 'spam' (i.e. bots that vote incorrectly over and over again), and b) the manually driven manipulation (i.e. groups of people who drive the relevance of sites up or down in their own interest). Assumption #2: Technology driven spam can be handled --------------------------------------------------------- I personally think that bots will definitely be an issue, but can be handled with the right technology effort (from recognizing multiple entries from one of rew IP addresses down to other patterns). With the right approach, Wikia should be able to reduce the sheer amount of it to the level which might be comparable to the second problem (manual manipulations), i.e. only very careful bots which don't create too many entries at a time will go undetected. Assumption #3: Elminate wrongdoing by allowing a core group to provide 'corrective' ratings ------------------------------------------------------------------------------------------- As the reduction of voters to a small group will never yield the appropriate number of required ratings to make the project successful, 'after the fact' elimination of wrongdoing seems to be the only way to set things straight. I would assume that we can build a significantly large group of 'trusted users' (see below on how I would go about that), who could be allowed to 'downgrade' or 'upgrade' sites, put them on watch lists, or even be asked by Wikia to review sites. Those countervotes should weigh in in a way so they offset a significant number of fraudulent votes.with only a few "Wikia Community Cleaners" doing their job. How to become a 'Community Cleaner' ---------------------------------------- To me, this seems a relatively easy approach, comparable to many concepts already in use with many sites. In the Wikia case, I would see the following approach: - Subscribed users who accept this role - Provided a significant number of site ratings - Site ratings confirmed by a significant number of other users These are my first thoughts on the subject. I am happy to discuss and elaborate further, should anybody see a need. Good luck with this important project! Jay From ger.dupont at gmail.com Thu Jan 4 09:04:04 2007 From: ger.dupont at gmail.com (=?ISO-8859-1?Q?G=E9rard_Dupont?=) Date: Thu Jan 4 09:04:06 2007 Subject: [Search-l] [introduction] [target] Some reflexions Message-ID: <471965e10701040104q34af1319od9e4ee3f8a8c107@mail.gmail.com> Hi everibody, First, I introduce myself : I'm a Phd student and I work in a R&D department involved in information search field. I just discover the project 3 days ago, but as this malling list appears to be open, I wanted to post some of my ideas(or only to write things that some are thinking about to begin a debate) So, how do I think to a "*wiki-inspired search engine*" ? The community should be involve in many ways, and not only rating sites. As in Wikipedia, we should be able to comment sites and discuss on them. But the approach should be different from the open directory project which try to begin with general directories to specific sites. For example we can think to a wiki specific to Website, and the categorization will be build upon the description and discussion from the community. It is linked to a well-known part of search engine design : Web topology. This appear to be fruitful, but really complicated in a full automated way, so why not asking the users to help the machine ?. Of course, that could be quite complicated for the "simple" user. So tools like browser toolbar should be usefull to provide simple rating tools and link to more rich possibilities to comment and discuss. That's for the feedback part (second part of the recent Jimbo explanation) ... which should be one of the last of course... Then the more technical part : how the engine will benefits from the community ? As far as I known, the feedback from user is one of the most valuable strategy to better learn and improve the precision of a search engine. The research in that field can find their roots in the 70's. Nothing functional (or nothing big enough (I don't want to minimize the work done) ) comes out those studies, but it can be done ! The ranking algorithm can of course include a user rating part (as google page ranking) in order to improve the level of rated Web site in many ways (that point should be fully studied to work well, but it will). The spiders could also benefit from user experience to look forward for the invisible part of the Web (I don't know the word for that in English, it means the part of the Web that is not clearly linked to the most well known sites, and so ignored when searched trough search engine). The Website wiki should guide the spiders (what about wiki-spider ?). Of course, as it has been said, the community should be able to control those machine process... That point is partly covered by the previous ideas, and some work has be done for the developer in order to permit more flexibility and a user-customized search engine. It means transparency, and introduce the problem of the spammer... But I'm sure some solution can be found to resolve such difficulties if we care of it from the beginning. To conclude, my point of view on the architecture of the core engine and the Lucene/Nutch/Hadoop VS Yacy debate. As I experienced Lucene as a developers, I know that it is a good and reliable project. For Nutch and Hadoop, I only know their aims but I can estimate that they are as reliable... But the p2p aspect of Yacy can has lots of avantages. The most important of them is that it offer to reduce the hardware investment. As I see this project want to be a competitor to google, it will need a lot of hardware capacities (and much more than that... ). Web search should be enable without installing anything to avoid to frighten user, and so big servers should bootstrap the index. But for the future, the p2p must be included in the project. I heard about building a link between hadoop and Yacy... that could be very cool ! So what a long message... And no real conclusion ! Or something coming from one of my teachers : "to eat a very big elephant, we only need t find a good knife and cut it in some very small pieces..." G.Dupont -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.wikia.com/pipermail/search-l/attachments/20070104/527fcd5a/attachment.html From evanndro.reis at arvos.com.br Thu Jan 4 11:41:00 2007 From: evanndro.reis at arvos.com.br (Evanndro Paes dos Reis) Date: Thu Jan 4 11:48:24 2007 Subject: [Search-l] First steps to getting it right... In-Reply-To: <20070104070903.50630.qmail@web37206.mail.mud.yahoo.com> References: <20070104070903.50630.qmail@web37206.mail.mud.yahoo.com> Message-ID: <06CA2BC5-3DCF-4872-8981-2778E8BFC7C0@arvos.com.br> Well, Charlene, I am one of the many (not two) who went on reading your posting until the end. And I find it very interesting (along with many comments in this forum). First of all, let me introduce myself. I am entrepreneur in the software industry in Brasil, graduate from Duke University, working exclusively with open source building blocks, after spending my previous life working for companies such as IBM, Apple and Sun. I've found this idea of having a more "humanized" search engine (my first mistake here) compelling, simple, necessary, exciting therefore extremely powerful. Congratulations for pulling this off. And thanks for allowing people like me to participate. But I've think I am still not getting it right (to paraphrase Charlene). IMHO, search engine is a very limiting word for this project. I think we are opening the doors for something bigger. But enough with rhetoric. Here are some reflections I'd like to share with you all. How we are going to apply these, it is really up to all of you. Discard the bad ones. Save the good ones (if any). 1. Search engines (second mistake) are as good as how fast they return answers to our queries. 2. Sometimes we really don't know what we are looking for. Some guidance is appreciated. 3. Computers have less bias when answering to queries. Human beings have hidden agendas and are extremely influenced but that when answering to queries. 4. Search is a small part of the game. Indexing and answering are the key ones. 5. API are the real deal in terms of web applications today. Let developers use the created infrastructure to improve/increase their own solutions. 6. Knowing what your peers are looking for is a good hint for future queries. 7. An open information architecture for indexing, rating, exchanging and formatting query results would be nice. 8. Culture has a huge influence on how searches/queries are made. Only a small percentage of the world speak English. 9. Context changes the meaning of queries (when I am programming using my Java IDE the term "java" has on meaning at that particular time. When I am using my browser to purchase new drinks it has another) A. IMHO some key components of a successful "query-answering" environment: A.1. Know it all (it is all about indexing/rating/relating/ranking) A.2. Light speed A.3. Different formats (not only links) A.4. Programmable A.5. Predictable A.6. Guider (here I see a huge need for humans) A.7. Active (not only waiting for my queries to give me answers. Am I too AI, here?) Well, there is more, but that's enough for my first post. Carpe Diem, Evanndro Paes dos Reis evanndro@evanndro.com http://www.evanndro.com http://www.evanndro.com/openvox2 On Jan 4, 2007, at 5:09 AM, Charlene Wright wrote: > > Hi All, > > I'd just like to throw out some random thoughts, I don't > necessarily have time to organize or "proofread" this, I'll just > try to get the main things in, and not worry too much about > presentation right now. > > I did just a little bit of research today at a local bookstore > which allows reading on-site without having to purchase, a chapter > in one book and some parts mostly at the end of another book. > > (1) Information Architecture for the World Wide Web > O'Reilly, ISBN 1-56592-282-4 > Louis Rosenfeld & Peter Morville > (c) Feb 1998 1st Edition > Chapter 6 > (2) Search Engine Optimization for Dummies > Wiley, ISBN 0-7645-7658-6 > Peter Kent > (c) 2004 (possibly 3rd printing?) > > There was a screenshot at (1 page 107) of a search engine screen > similar to the kind of search engine I have always dreamed of > having, if I had a search engine interface like that I'd be the > happiest searcher in the world. It had a query box at the top of > the page, then under that a numbered list of previous searches. > The list looked something like this: > > (3) 2 and 1 > (2) keyword (diabetes) > (1) subject (medical diagnosis) > > The query box at the top contained something like "3 and date > (>2000)". Now, not everyone is going to want or need functionality > at this sort of level, but I'm very comfortable with boolean > operators, and if I had a full-web search engine interface like > this, I'd be in hog heaven. I've never used a search engine with > *this* much flexibility -- Yes, I've used boolean operators, and > very successfully, but I've never seen an interface where I could > combine previous entire queries with boolean operators. [At this > point someone will probably give me links to at least 12 search > engines that allow this.] > > A couple of thoughts I'd like to add to this concept. First, and > most importantly, this brings up a very good point: diversity, and > flexibility. We need to have a diverse set of interfaces; some > users will want a very simple text box with a Go button, some users > will want a form sort of interface with some common fields shown, > and some users will want a Full-Throttle interface like the one > above. It would be great to be able to offer this sort of > flexibility. In the City of Perfectville, we could even offer some > sort of API which would allow Anyone to develop their own interface > (s), and we could integrate some of the more popular interfaces > that people come up with into the main site. In this way, we can > leverage not only the power of the masses to create better results, > but also to create better interfaces. > > While I was considering what I'm calling the Full Throttle > interface described above, I thought of a possible expansion on the > idea, making it even MORE flexible (if you can imagine that). Ala > template parameter syntax in MediaWiki, I'm thinking of being able > to add options to search fields in the query, something like date > (>2000|ascending) for search results with dates on or after > 2000-01-01, sorted in ascending (chronological) order, vs date > (>2000|descending) with reverse chronological order (newest > first). Another usage could be author(|asc) where there's no > filter criteria for author, but the results will be sorted in > ascending order by author. A search could be made case-sensitive > as regards a particular field (but not the other search fields) > using keyword(whatever|cs). > > At (1 page 117) is an example interface that would allow the user > to tweak the relevance algorithm. In that interface the user picks > an Importance (Low, Medium, or High) for each of several aspects: > finding all the search terms, the number of times a search time > appears on the page, how early in the text a search term appears, > whether the term appears in the title, the proximity of the search > terms to each other, and whether the terms appear in the order > given. This would give Peter what he was describing, and takes the > openness that Jimbo wants one step further: not only can the user > tell how the relevance was determined, but they can also change it > if they have a reason to. > > (1 page 121) points out the need for what I'm calling a Query > Help Desk. Sure, we're going to have lots of human input into the > database, but what if all that still doesn't get the user what they > are really looking for? Then we can take the human contribution to > its ultimate conclusion: give the user a way to submit a > description of what they are looking for, and how it differs from > what they got, to a real (group of) human(s) who can assist in > building the right query. Yeah, this has a potential to generate a > HUGE FLOOD, but hey, no one thought we could keep up with vandalism > on Wikipedia either, but we do a pretty good job in general (at > least I think so). > > (1 pages 125 to 129) present "Search Zones". What I took away > from this, is it would be useful, in addition to being able to > filter for keywords and author/date/title and so on, to have > categorization and be able to narrow the results by category. > > Some resources mentioned in (2) that I found interesting: Google > Zeitgeist, searchenginewatch.com, currentwisdom.com, > SearchEngineBulletin.com. > > That's all for now (and I'm sure plenty, as I'm lucky if even 2 > people are still reading this far). > > Thank you, > > Charm > > ----- Original Message ---- > From: Jimmy Wales > To: peter burden > Cc: search-l@wikia.com > Sent: Wednesday, January 3, 2007 5:51:34 PM > Subject: Re: [Search-l] First steps to getting it right... > > peter burden wrote: >> By allowing search users to specify their own ranking weightings (and >> possibly algorithms) >> at search time would provide a further opportunity for the >> community to >> experiment and >> feed back their experiences. > > Yes, although realistically this should be available upon request, > rather than being the default of course. Most of the time when I > search, I just want to search, I don't want to fiddle with algorithms. > But absolutely, having the possibility for people who are into > experimentation to be able to do so is great. :) > >> I do not think community voting on links etc., is a good idea. I >> can't >> see it scaling. Community >> input should be directed towards tuning and developing ranking >> algorithms and spam detection >> mechanisms. > > Just to be clear, I generally agree with this. I do think having a > way > for ordinary users, who are not prepared to get involved heavily, > to add > their knowledge to the mix is crucial. In my model, these "thumbs > up / > thumbs down" things play a secondary and advisory role to the > community. > >> My main concern relates to the scale of the whole enterprise. Are we >> intending to outdo you know who (as well as Yahoo and MSN)? If so >> a long careful look at the >> numbers is called for. > > Eh, I dunno. What I would anticipate here is that with all open > source > algorithms and software, as well as open indexing (i.e. making all the > data available freely as well), we will see a ton of people borrowing > our technology for all kinds of purposes. That's what a healthy free > culture ecosystem should look like, I think. :) > > --Jimbo > _______________________________________________ > Search-l mailing list > Search-l@wikia.com > http://lists.wikia.com/mailman/listinfo/search-l > > > > > __________________________________________________ > Do You Yahoo!? > Tired of spam? Yahoo! Mail has the best spam protection around > http://mail.yahoo.com > _______________________________________________ > Search-l mailing list > Search-l@wikia.com > http://lists.wikia.com/mailman/listinfo/search-l Carpe Diem, Evanndro Paes dos Reis evanndro.reis@arvos.com.br arvos, a nova gera??o do software http://www.arvos.com.br -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.wikia.com/pipermail/search-l/attachments/20070104/01fc29c8/attachment-0001.html From renaud at oslutions.com Thu Jan 4 12:46:32 2007 From: renaud at oslutions.com (Renaud Richardet) Date: Thu Jan 4 12:53:30 2007 Subject: [Search-l] First steps to getting it right... In-Reply-To: <459C50DF.5020700@wikia.com> References: <459C1A03.9030708@wikia.com> <355a36af0701031649t7b226673oa338472be3456e7@mail.gmail.com> <459C50DF.5020700@wikia.com> Message-ID: <459CF728.9090405@oslutions.com> Jimmy Wales wrote: > Aerik Sylvan wrote: >> I think you hit a lot of key points with your note, but I think the >> gentleman (sorry, I forget your name) who asked about the value >> proposition asked a key one as well. Since people are on this list, >> or looking at the wiki, they must be bought into the community idea. >> But I think discussing the value proposition - is a key ingredient to >> forming > > (Aerik wrote a nice summary of search quality as a value proposition.) > > To add to this, let me throw out what emotionally rings more strongly > with me as a value proposition here: search is a fundamental part of > the infrastructure of the net, and as with all the other fundamental > parts of the infrastructure of the net, it needs to be *transparent*, > it needs to be *open*, we need to be able to look under the hood (as a > society) and see what is what, and why things are ranked as they are. *Transparency* of results was one of the motivation for Doug Cutting to create Nutch. See for example http://www.erzsuche.de/en/search.html (which is one of the public search websites powered by Nutch, see a list at http://wiki.apache.org/nutch/PublicServers), and search for "ferien" On the result page, you can click on "explain", and you might get something like the text below (try http://www.erzsuche.de/explain.jsp?idx=0&id=51808&query=ferien&lang=de) OK, this explanation page is still very geeky -- but basically, Nutch explains to you why it returned this page, and what factor influenced the ranking. We could easily enhance this page, and allow people to modify the factor and the ranking. *Openness* of search results: Nutch conforms to the opensearch standard (http://www.opensearch.org/Home), which allows other applications to use Nutch "as an API" and integrate Nutch into their application. This way, you could have a front-end user interface in PHP that call the opensearch interface and parses the search results. HTH, Renaud page * segment = 20070101140520 * digest = 6425ee6ad7f2e578560c8bbeb117098a * lastModified = 1153410057000 * contentLength = 10841 * primaryType = text * subType = html * url = http://www.ferien-osterzgebirge.de/ * title = Ferien im Erzgebirge Ferienwohnungen in Altenberg und Informationen zu der Region * boost = 1.2476859 score for query: ferien * 7.449376 = sum of: o 0.8439349 = weight(url:ferien^4.0 in 51808), product of: + 0.768083 = queryWeight(url:ferien^4.0), product of: # 4.0 = boost # 8.790039 = idf(docFreq=79) # 0.021845266 = queryNorm + 1.0987549 = fieldWeight(url:ferien in 51808), product of: # 1.0 = tf(termFreq(url:ferien)=1) # 8.790039 = idf(docFreq=79) # 0.125 = fieldNorm(field=url, doc=51808) o 3.834684 = weight(anchor:ferien^2.0 in 51808), product of: + 0.39962482 = queryWeight(anchor:ferien^2.0), product of: # 2.0 = boost # 9.146714 = idf(docFreq=55) # 0.021845266 = queryNorm + 9.59571 = fieldWeight(anchor:ferien in 51808), product of: # 4.7958317 = tf(termFreq(anchor:ferien)=23) # 9.146714 = idf(docFreq=55) # 0.21875 = fieldNorm(field=anchor, doc=51808) o 0.03497626 = weight(content:ferien in 51808), product of: + 0.10626862 = queryWeight(content:ferien), product of: # 4.8646064 = idf(docFreq=4053) # 0.021845266 = queryNorm + 0.32913065 = fieldWeight(content:ferien in 51808), product of: # 1.7320508 = tf(termFreq(content:ferien)=3) # 4.8646064 = idf(docFreq=4053) # 0.0390625 = fieldNorm(field=content, doc=51808) o 0.9289492 = weight(title:ferien^1.5 in 51808), product of: + 0.284908 = queryWeight(title:ferien^1.5), product of: # 1.5 = boost # 8.694729 = idf(docFreq=87) # 0.021845266 = queryNorm + 3.2605233 = fieldWeight(title:ferien in 51808), product of: # 1.0 = tf(termFreq(title:ferien)=1) # 8.694729 = idf(docFreq=87) # 0.375 = fieldNorm(field=title, doc=51808) o 1.8068316 = weight(host:ferien^2.0 in 51808), product of: + 0.39734477 = queryWeight(host:ferien^2.0), product of: # 2.0 = boost # 9.094528 = idf(docFreq=58) # 0.021845266 = queryNorm + 4.547264 = fieldWeight(host:ferien in 51808), product of: # 1.0 = tf(termFreq(host:ferien)=1) # 9.094528 = idf(docFreq=58) # 0.5 = fieldNorm(field=host, doc=51808) > > Here's something excellent from Blake Ross of Firefox fame: > http://www.blakeross.com/2006/12/25/google-tips/ > > What's great about Firefox? Well, first of all, it is just a *better > browser*... that's like Aerik's value proposition of search quality. > But at a deeper level what excites me about Firefox is that it is > *free software*. > > And as such, it is the foundation for all kinds of interesting > experimentation that will push the net forward... (Flock browser for > example). > > Free knowledge, a free world, requires an open transparent and free > (in the sense of GNU) infrastructure. > > I want us to help build that. > > --Jimbo > > p.s. Nutch and Lucene are critical building blocks, and so perhaps are > some of the other projects that have been mentioned here. -- renaud richardet +1 617 230 9112 renaud oslutions.com http://www.oslutions.com -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.wikia.com/pipermail/search-l/attachments/20070104/f4dbce8a/attachment.html From erikmednis at gmail.com Thu Jan 4 06:26:15 2007 From: erikmednis at gmail.com (Erik Mednis) Date: Thu Jan 4 13:07:17 2007 Subject: [Search-l] Re: Core functions, Requirements, Spam, A Wish-List...(Re: Search-l Digest, Vol 2, Issue 3) In-Reply-To: <48fa644c0701032217k736b1fcbs8cfce9d43cf95ddb@mail.gmail.com> References: <48fa644c0701032217k736b1fcbs8cfce9d43cf95ddb@mail.gmail.com> Message-ID: <48fa644c0701032226w3beaf9a4x5dfb68c0152ef7ed@mail.gmail.com> Hi All... I'd be interested in seeing a side-thread devoted to the nay-sayer commentary, specifically, what are/were the reasons behind why 'it can't be done'... Or at least a digest available for viewing...I find those discordant views to be both a great source of motivation and an occasional inspiration in solving that thing that can't be done...so long as they dont get too distracting... Concur with Aerik et. al, we're best served by seeing some consensus on what the real problem to solve is (beyond transparency in the algorithm).... Where's the Requirements Document ?! :) I look forward to Jimmy's outline of a vision, but from a Creative professional and former Search Engine Marketing exec's point of view, I'd throw these into my wish-list: *Once indexed, rankings would not be limited to occasional 'crawls' or periodic tweaks in the algorithm...the famous 'Google-dance' sure is very dramatic, and it keeps the SEOs in business, but the overall effect is quite disruptive unless its Open and Publicly announced in advance. The search engine would use a well documented algorithm to generate a master index---a starting point and initial taxonomy, and then use a combination of user ratings and actual click-through data to dynamically rank results. Queries or results with sudden upswings in click-through would make fine candidates to be tossed into the 'Editorial-Review' Queue. I assume we're already there on this, but can cookies be used to create a 'trusted click' system ? *It should also be able to parse requests and results in folksonomies/tags as well. I would argue that if you can generate enough critical mass, Folksonomies will outperform Taxonomies in relevance, every time. The Wiki community is pretty uniquely situated to be able to make this kind of traffic happen. Finding a way to effectively combine the two is obviously tantalizing. *It might consider the relationship between the syntax of the users' queries and the [potentially relevant] existing tag base in how it orders results. Is the query 'speaking in Tags' ? Deliver Tags-based results...or speaking in plain speech ? A shocking number of search queries are those plain language questions, where someone is literally asking the 'Mystic Oracle' a real question. Can it have linguistic filters incorporated into results, [attempting] to tailor more relevant results to 'Why', 'Where', 'Who', etc. ? Perhaps this is a layer of the 'voting' or 'editing' process and suggests a higher level of tags ? *It should have the ability to display results in multiple formats or modes. (eg - options for visual hierarchies, trees, tag clouds, 3d visualizations, lists, etc.) Google and Yahoo index listings are not very "usable" - any Google SERP heat map will show like a 'Golden Triangle', where the majority of clicks occur in exactly the same place regardless of the quality or nature of the results. *Users should be able to opt-up or opt-down the privacy safeguards, and potentially _share_ their recent searches... For an extreme example, I might be perfectly fine with a browser plug-in or even a remote service that considers my search history or the collective interests of my social network when delivering results, _IF_ I was confident that it was secure and going to deliver the very best result, spam/splog/spuncontent-free. On the other hand, there may be times when I want to search in privacy, and that should be dead simple to do. *A completely open API and archival of all search queries... [Yes, Privacy Issue] An Open, Unsponsored 'Zeitgeist'-like function with a potential data set this large is price-less. * And finally, it should be channel agnostic but contextually aware from Day 1 (Eg, your USER-AGENT is from a mobile device, it serves Mobile friendly content, etc - ) I hate to throw things over the wall and run, but those are a few quick thoughts I had to share.. I expect we'll have MUCH more to discuss when Jimmy shares his ideas on the core design. And Quite Lastly, Randy Wilson wrote: "I think that there is substantially more unique and valuable information than spam on the web, and that valuable information is not limited to a few wiki*.org domains." This is important. I believe spam may indeed outnumber real content soon, if not already... Automated Splog generators and Splog networks are bad enough, but just do a search on 'Web Spinning' or 'Content Spinning' and you'll see how bad the situation is about to really become, and why the success of this project is so important. "The hardest thing about eating an elephant may be figuring out where to start. Right now it seems like everyone is standing around looking hungry, and no-one even has a fork, much less a knife, or barbeque grill." And new Metaphor, please...getting hungry Best, -Erik Mednis On Jan 3, 2007, at 7:23 PM, search-l-request@wikia.com wrote: Send Search-l mailing list submissions to search-l@wikia.com To subscribe or unsubscribe via the World Wide Web, visit http://lists.wikia.com/mailman/listinfo/search-l or, via email, send a message with subject or body 'help' to search-l-request@wikia.com You can reach the person managing the list at search-l-owner@wikia.com When replying, please edit your Subject line so it is more specific than "Re: Contents of Search-l digest..." Today's Topics: 1. [Suggestion] Partition the email channel (William Surowiec) 2. Interesting presentation on Lucene, Nutch (Aerik Sylvan) 3. Care (Paul Charles Leddy) 4. First steps to getting it right... (Jimmy Wales) 5. Relationship of Community and Search Engine (Randy Wilson) 6. Re: Relationship of Community and Search Engine (Jimmy Wales) 7. Re: Relationship of Community and Search Engine ( thomasasta@gmx.net) 8. Re: Relationship of Community and Search Engine (Aerik Sylvan) ---------------------------------------------------------------------- Message: 1 Date: Tue, 02 Jan 2007 14:46:58 -0500 From: William Surowiec Subject: [Search-l] [Suggestion] Partition the email channel To: Search-l@wikia.com Message-ID: <459AB6B2.9000107@gmail.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed I sense we are a community of many skills. And, somewhat like in the tale of the blind men encountering an elephant for the first time, we are describing in meaningful detail the facets we, through the lens of our experience, can immediately recognize. I believe this has great value (aside from the postings themselves, I find many of the links offered just delicious.) But we are using just one channel to discuss all of the elephant at once. The channel may become cacophonous and hard to sort out (somewhat like a noisy party.) My suggestion is we partition it into a number of sub channels (but still keep one email list.) My sense is we have been discussing at least the following issues in our postings: target related, software/architecture related, introducing ourselves, links to potentially interesting and relevant material, business concerns, the ever present miscellaneous, etc Can we come up with a classification of our topics and add the classification to the subject line? I have taken the liberty of placing such a tag on this email - the syntax and the meta tag are illustrative only. My objective is to facilitate both the initial reading and subsequent finding of postings. Now, instead of 754 people discussing this point - how about we vote with our fingers (I come from Brooklyn NY, we have some rather explicit and ubiquitous ones for up close, in person, use) and individually begin to invent and use a tagging system (if we agree) and allow a consensus to potentially emerge. Bill ------------------------------ Message: 2 Date: Wed, 3 Jan 2007 11:50:49 -0800 From: "Aerik Sylvan" < aerik@thesylvans.com> Subject: [Search-l] Interesting presentation on Lucene, Nutch To: search-l@wikia.com Message-ID: < 355a36af0701031150r5119fb6ej8a3a931f6be33eeb@mail.gmail.com> Content-Type: text/plain; charset="iso-8859-1" http://www.haifa.ibm.com/Workshops/ir2005/papers/DougCutting-Haifa05.pdf -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.wikia.com/pipermail/search-l/attachments/20070103/7c4ec90c/attachment-0001.html ------------------------------ Message: 3 Date: Wed, 3 Jan 2007 12:47:25 -0800 From: "Paul Charles Leddy" < pcleddy@gmail.com> Subject: [Search-l] Care To: Search-l@wikia.com Message-ID: < d0ad9b7e0701031247l544cb606r9515a73d487c3d32@mail.gmail.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Something to learn from wikipedia and apply to this human-driven search "engine": Articles in wikipedia are only as good as the group of people that currently CARE about them. As the community of those who give-a-care changes, the article can swing this way or that. There is a possibility the Talk pages show this history together with the history. ------------------------------ Message: 4 Date: Wed, 03 Jan 2007 16:02:59 -0500 From: Jimmy Wales Subject: [Search-l] First steps to getting it right... To: search-l@wikia.com Message-ID: <459C1A03.9030708@wikia.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed One of the things that I believe in passionately is genuine human communities, as opposed to "crowdsourcing". What do I mean by that? I mean, people who get to know each other, over time, as real human beings, and through that process, gain a sense of trust and responsibility for each other and for the task at hand. So for me, if we are to succeed here, this is the first place we need to focus attention... This project is different from a traditional wiki, in which people are coming to know each other through the process of writing. On Wikipedia, on Uncyclopedia, on world.wikia.com , on thousands of other wikis around the web, people write, rewrite, debate on the talk page, and come to some mutual understanding and trust. The process is messy, but that's because the process is human, and genuinely human processes are going to be messy and complex. I think we need to design for 3 levels of process here, because of the differences that I see between what we are trying to do and what a normal wiki is trying to do. First, there is the spider and the ranking algorithms. These need to be public and transparent. They need to be controllable by the community. But the very nature of this machine-centric process means that a lot of the heavy lifting here will be on the level of *code*, i.e., this is a community of developers. Second, there is the possibility and necessity of massive feedback from the general public. This can be in the form of voting, thumbs up / thumbs down on links, digg-style pages, etc. My own view is that these things should be simple, intuitive, and the means by which we encourage people to get more involved. HOWEVER, because these are precisely the kinds of mechanisms that can be "gamed" by spammers and other ne'er-do-wells, they have to be treated with a significant amount of caution and be public, transparent, and controllable by the community. And finally, the bit that I think makes it all work: the space for the core community of users, users who are not necessarily programmers, but who can make serious editorial judgments in a neutral way through the process of open discussion and debate. ===== I can explain each of these levels and what I see as the reason for them, drawing on my experiences with search in the past, my experiences with wikis, my observations and experiences with dmoz, etc. But that can come out over time. For now I just want to point out that the largest amount of skepticism about what we are going to try to accomplish here is driven by the inherent issue of spammers. There are huge incentives for people to try to abuse our good will and we have to anticipate and expect that. But, unlike many of the skeptics who think that this is impossible, I am very confident that if we can build a genuine community and give ourselves as a community the tools we need, then we can deal with this issue without a lot of trouble. Tomorrow I will write more about how I see the core design working. --Jimbo ------------------------------ Message: 5 Date: Wed, 3 Jan 2007 16:36:59 -0600 From: "Randy Wilson" < xixtas@gmail.com> Subject: [Search-l] Relationship of Community and Search Engine To: search-l@wikia.com Message-ID: < 10f6add00701031436j2bb4a717oa02c6c0e977fa0a@mail.gmail.com> Content-Type: text/plain; charset="iso-8859-1" Hello. I am an IT manager and entrepreneur in real life and an editor at the Open Directory Project and also at Wikibooks and Wikipedia. I don't understand the vision. I guess one fundamental thing that would help me understand is to describe how the community will interact with the engine. Is this a project just for developers, or is there a role for content reviewers and classifiers as well? Will content reviewers mainly be focused on blacklisting and whitelisting? Disambiguation pages to discern user intent is not a new concept in search engines. Will this project include a classification component? The statement that non-spam is "more finite" seems troublesome to me, because though it is undoubtedly true (accepting for the moment that there are levels of finiteness), it implies that much of what is out there to be classified can be ignored. I think that there is substantially more unique and valuable information than spam on the web, and that valuable information is not limited to a few wiki*.org domains. The hardest thing about eating an elephant may be figuring out where to start. Right now it seems like everyone is standing around looking hungry, and no-one even has a fork, much less a knife, or barbeque grill. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.wikia.com/pipermail/search-l/attachments/20070103/49db429e/attachment-0001.html ------------------------------ Message: 6 Date: Wed, 03 Jan 2007 17:55:23 -0500 From: Jimmy Wales Subject: Re: [Search-l] Relationship of Community and Search Engine To: Randy Wilson < xixtas@gmail.com> Cc: search-l@wikia.com Message-ID: <459C345B.8010605@wikia.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Randy Wilson wrote: I don't understand the vision. I guess one fundamental thing that would help me understand is to describe how the community will interact with the engine. Is this a project just for developers, or is there a role for content reviewers and classifiers as well? Will content reviewers mainly be focused on blacklisting and whitelisting? My design includes room for content reviewers and classifiers. some of that will be focused on blacklisting and whitelisting, but I think the community can and should do a lot more than that. Open dialog and discussion is incredibly powerful for getting thoughtful results in any process... and this will be encouraged. Disambiguation pages to discern user intent is not a new concept in search engines. Will this project include a classification component? The statement that non-spam is "more finite" seems troublesome to me, because though it is undoubtedly true (accepting for the moment that there are levels of finiteness), it implies that much of what is out there to be classified can be ignored. I think that there is substantially more unique and valuable information than spam on the web, and that valuable information is not limited to a few wiki*.org domains. Yes, there is a great deal of valuable information out there, and it is not limited to a few domains. The hardest thing about eating an elephant may be figuring out where to start. Right now it seems like everyone is standing around looking hungry, and no-one even has a fork, much less a knife, or barbeque grill. That's ok. The wiki world is messy. We eat with our hands. ;-) --Jimbo ------------------------------ Message: 7 Date: Thu, 04 Jan 2007 01:04:10 +0100 From: thomasasta@gmx.net Subject: Re: [Search-l] Relationship of Community and Search Engine To: Jimmy Wales , xixtas@gmail.com Cc: search-l@wikia.com Message-ID: < 20070104000410.118560@gmx.net> Content-Type: text/plain; charset="iso-8859-1" Hi good question and ideas. Users should insert pages into the search index and users should rate urls /pages and could as well add comments to each page. This is a simple concept: - one peer/user one vote for one url/page - scala: ++ / + / - / - - Furthermore a Comment for that page-url could be written. We could all store it in a database. Yacy.net, if it is allowed to come back to this advanced tool, has already the rating of urls and as well the concept to store such comments, in the DHT database. So anyone can not only call urls for indexing, each peers can index them itself with the local node and each surfer/node-user can rank in a toolbar each viewd page. Furthermore yacy allows to create bookmarks, and I can make them public, and so an index is created of suggested favourite site in the p2p network This function does not offer a central machine to lookup the made public bookmarks of nearest peers. so quite a good concept. And you ask what can be done ? If you are a coder, help coding yaxy in jave. If you can XUl or C++, then lets focus on the toolbar for yacy, there is the need for a i-explorer toolbar, because the one is for mozilla. If you can not code, then just run a yacy peer and try the search expierience. Play around with this system and lean all features, you can do with it. Everyone, which is not installing a testmachine, (of which search engine ever, should unsubscribe from the list or send feedback to improvement of the mentioned applications nutch or yacy). so.. install something, code something and give us feedback. Furthermore we are all interested in the paper of Jimmy, which he announced to get a decision between nutch and yacy.. but I guess it is sensefull to have still a period of dicussion, to get ideas and votes out of the community and as well the open question to have a joint venture with the coding teams and the wikimedia foundation has to be organized, So I hope we get an answer after some of the next emetings of the wikipedia foundation. But really... Jim just waits for the servers to be sent... ;-)) Kind regards -------- Original-Nachricht -------- Datum: Wed, 03 Jan 2007 17:55:23 -0500 Von: Jimmy Wales An: Randy Wilson Betreff: Re: [Search-l] Relationship of Community and Search Engine Randy Wilson wrote: I don't understand the vision. I guess one fundamental thing that would help me understand is to describe how the community will interact with the engine. Is this a project just for developers, or is there a role for content reviewers and classifiers as well? Will content reviewers mainly be focused on blacklisting and whitelisting? My design includes room for content reviewers and classifiers. some of that will be focused on blacklisting and whitelisting, but I think the community can and should do a lot more than that. Open dialog and discussion is incredibly powerful for getting thoughtful results in any process... and this will be encouraged. Disambiguation pages to discern user intent is not a new concept in search engines. Will this project include a classification component? The statement that non-spam is "more finite" seems troublesome to me, because though it is undoubtedly true (accepting for the moment that there are levels of finiteness), it implies that much of what is out there to be classified can be ignored. I think that there is substantially more unique and valuable information than spam on the web, and that valuable information is not limited to a few wiki*.org domains. Yes, there is a great deal of valuable information out there, and it is not limited to a few domains. The hardest thing about eating an elephant may be figuring out where to start. Right now it seems like everyone is standing around looking hungry, and no-one even has a fork, much less a knife, or barbeque grill. That's ok. The wiki world is messy. We eat with our hands. ;-) --Jimbo _______________________________________________ Search-l mailing list Search-l@wikia.com http://lists.wikia.com/mailman/listinfo/search-l -- Der GMX SmartSurfer hilft bis zu 70% Ihrer Onlinekosten zu sparen! Ideal f?r Modem und ISDN: http://www.gmx.net/de/go/smartsurfer ------------------------------ Message: 8 Date: Wed, 3 Jan 2007 16:22:57 -0800 From: "Aerik Sylvan" Subject: Re: [Search-l] Relationship of Community and Search Engine To: search-l@wikia.com Message-ID: <355a36af0701031622o244994c7w60fa71d3ba8cb49f@mail.gmail.com> Content-Type: text/plain; charset="iso-8859-1" On 1/3/07, Randy Wilson wrote: I don't understand the vision. I guess one fundamental thing that would help me understand is to describe how the community will interact with the engine. Is this a project just for developers, or is there a role for content reviewers and classifiers as well? (etc...) The hardest thing about eating an elephant may be figuring out where to start. Right now it seems like everyone is standing around looking hungry, and no-one even has a fork, much less a knife, or barbeque grill. Randy, I think that's kind of the point: This is a grand vision, and there are no details (or very few?) that are agreed upon yet. I think the only certainty is a vision that a search engine can have more / different community involvement than what is generally out there now, and this is a desired outcome. It's a big vision. Instead of asking Jimmy, or everyone else, what we're doing, I think at this stage is appropriate to say how you think the vision should look. So, yeah, everyone is standing around hungry. But we don't know that it's an elephant, I think we're still figuring out what we're hunting for. What do you want to hunt for? Aerik -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.wikia.com/pipermail/search-l/attachments/20070104/357e48f5/attachment.html ------------------------------ _______________________________________________ Search-l mailing list Search-l@wikia.com http://lists.wikia.com/mailman/listinfo/search-l End of Search-l Digest, Vol 2, Issue 3 ************************************** -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.wikia.com/pipermail/search-l/attachments/20070104/c19eb866/attachment-0001.html From gevangasteren at gmx.net Thu Jan 4 16:09:41 2007 From: gevangasteren at gmx.net (=?iso-8859-1?Q?=22G=E9_van_Gasteren=22?=) Date: Thu Jan 4 16:09:44 2007 Subject: [Search-l] Ranking/rating and avoiding spam Message-ID: <20070104160941.57770@gmx.net> I'm just a layman in this field, but here are two thoughts about 'rating' I got while reading some of the posts: - Instead of just rating a page or site 'good' or 'bad', human users can do much more. They can label it with keywords (preferably selected from a list, so the number remains limited; the list could be tree-structured), and they can rate it in many dimensions, like: amount of info, pleasant to read/view/navigate, fun/boring, compact/wordy', easy/difficult to read, thoroughness, common/rare info, mainstream/controversial, minimum reader age, speed/hard on PC&browser, etc., etc. Sure, the number of rating categories should be limited to keep it practical and fun, but with a rating GUI like on Google video it's really easy to do. - An idea to (help) avoid bots' rating: When a user subscribes, he/she receives a 'rating quota', which means that he can rate a certain number of pages and sites. After that, he sends in a request to 'refuel', which triggers some checking routines on the server side. Ge' -- Der GMX SmartSurfer hilft bis zu 70% Ihrer Onlinekosten zu sparen! Ideal f?r Modem und ISDN: http://www.gmx.net/de/go/smartsurfer From seb at schmoller.net Thu Jan 4 19:30:09 2007 From: seb at schmoller.net (Seb Schmoller) Date: Thu Jan 4 19:37:06 2007 Subject: [Search-l] The sites people bookmark In-Reply-To: <48fa644c0701032226w3beaf9a4x5dfb68c0152ef7ed@mail.gmail.com> References: <48fa644c0701032217k736b1fcbs8cfce9d43cf95ddb@mail.gmail.com> <48fa644c0701032226w3beaf9a4x5dfb68c0152ef7ed@mail.gmail.com> Message-ID: <459D55C1.6010000@schmoller.net> I'm writing as a searcher not as a technical person. Erik Mednis's said: > This is important. I believe spam may indeed outnumber real content > soon, if not already... Automated Splog generators and Splog networks > are bad enough, but just do a search on 'Web Spinning' or 'Content > Spinning' and you'll see how bad the situation is about to really > become, and why the success of this project is so important. I may have missed this point being made by someone else (or it may be totally unfeasible). The sites that people bookmark either locally or in connotea, digg, or delicious, or similar, which they come across in various ways - search, blogs, following links, etc, must, as a subset of all sites, have greater than average utility. The same must be true of the sites used in customised searches using Google or Rollyo, say. If a way could be found for WikiaSearch to concentrate its crawling on such human-selected sites, and to give added weight to sites that many had bookmarked (provided "spam bookmarking" could be detected), would that not more or less do away with the problem of spam web sites in search results? Seb Schmoller -- Phone: +44 (0)114 2586899 Fax: +44 (0)709 2208443 Address: 312 Albert Road, Sheffield S8 9RD, UK Web site: http://www.schmoller.net/ Blog (a.k.a. "Fortnightly Mailing"): http://fm.schmoller.net/ -- From jwales at wikia.com Thu Jan 4 19:44:34 2007 From: jwales at wikia.com (Jimmy Wales) Date: Thu Jan 4 19:45:01 2007 Subject: [Search-l] The sites people bookmark In-Reply-To: <459D55C1.6010000@schmoller.net> References: <48fa644c0701032217k736b1fcbs8cfce9d43cf95ddb@mail.gmail.com> <48fa644c0701032226w3beaf9a4x5dfb68c0152ef7ed@mail.gmail.com> <459D55C1.6010000@schmoller.net> Message-ID: <459D5922.1080107@wikia.com> Seb Schmoller wrote: > If a way could be found for WikiaSearch to concentrate its crawling on > such human-selected sites, and to give added weight to sites that many > had bookmarked (provided "spam bookmarking" could be detected), would > that not more or less do away with the problem of spam web sites in > search results? Yes, I think this is right. I advocate something like a "whitelisting" spider, with the core community controlling the crawl. I don't know exactly what role bookmarking might play in that... there are some interesting problems with bookmarking (user data privacy, etc.)... but the general concept of whitelisting makes sense to me when there is a large community make sure the comprehensiveness is there. If you spider everything and then rely on the community to weed out the junk, that's a problem from two perspectives: First, the community has to look at a lot of junk to get rid of it, rather than looking at good stuff to include it. Not very fun. Second, if you indiscriminately include everything all the time, then spammers just keep evolving and evolving forcing you to throw them out over and over and over again. --Jimbo From bora_98 at yahoo.com Fri Jan 5 01:09:18 2007 From: bora_98 at yahoo.com (Borislav Agapiev) Date: Fri Jan 5 01:16:03 2007 Subject: [Search-l] The sites people bookmark Message-ID: <20070105010918.78463.qmail@web61218.mail.yahoo.com> The "whitelisting" approach is a noble goal and it certainly addresses the issue of the overwhelming amounts of junk, however I think we should give some numbers to put things in perspective. For a quick background, I founded a search company and we are crawling the entire Internet regularly. First if we assume we have any kind of "crawler", that assumes that new sites to crawl are added AUTOMATICALLY, then we again have the problem of junk sites/domains popping in there anyway. So that means that the inclusion of new sites to crawl has to be controlled by the community. Now even that will be a very big issue since there will be hundreds of millions of (sub)dimains, i.e. abc.def.com ... which is still a very significant effort for a community of users, e.g. with 10,000 users each would be in charge of tens of thousands of (sub)domains alone and all links under them. Basically there is so much stuff out there (tens and actually hundreds of billions of pages) and so much of it is junk (very roughly half porn and half of the rest spam) that one pretty much has to use automation to try to approach any sense of comprehensiveness. On the other hand, it is always possible to use very tight "whitelisting" i.e. we are letting in only the sites we are absolutely sure of but then it will be a while before we cover the entire Web -:) In this case the effort becomes something like del.icio.us where we include only what people have decided to include and nothing else but I do not believe that is the goal here, or is it? My proposal would be for an automated ranking algorithm with user-driven ranking parameters. Basically there is no way around the fact that because there is so much junk out there AND we do not want to expose this junk to users then there has to be some automation to handle it. Borislav Agapiev ----- Original Message ---- From: Jimmy Wales To: seb@schmoller.net Cc: search-l@wikia.com Sent: Thursday, January 4, 2007 11:44:34 AM Subject: Re: [Search-l] The sites people bookmark Seb Schmoller wrote: > If a way could be found for WikiaSearch to concentrate its crawling on > such human-selected sites, and to give added weight to sites that many > had bookmarked (provided "spam bookmarking" could be detected), would > that not more or less do away with the problem of spam web sites in > search results? Yes, I think this is right. I advocate something like a "whitelisting" spider, with the core community controlling the crawl. I don't know exactly what role bookmarking might play in that... there are some interesting problems with bookmarking (user data privacy, etc.)... but the general concept of whitelisting makes sense to me when there is a large community make sure the comprehensiveness is there. If you spider everything and then rely on the community to weed out the junk, that's a problem from two perspectives: First, the community has to look at a lot of junk to get rid of it, rather than looking at good stuff to include it. Not very fun. Second, if you indiscriminately include everything all the time, then spammers just keep evolving and evolving forcing you to throw them out over and over and over again. --Jimbo _______________________________________________ Search-l mailing list Search-l@wikia.com http://lists.wikia.com/mailman/listinfo/search-l __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com From aerik at thesylvans.com Fri Jan 5 01:43:50 2007 From: aerik at thesylvans.com (Aerik Sylvan) Date: Fri Jan 5 01:43:52 2007 Subject: [Search-l] The sites people bookmark In-Reply-To: <20070105010918.78463.qmail@web61218.mail.yahoo.com> References: <20070105010918.78463.qmail@web61218.mail.yahoo.com> Message-ID: <355a36af0701041743ycbc5f8bo4b8d34098260da20@mail.gmail.com> On 1/4/07, Jimmy Wales wrote: > > Yes, I think this is right. I advocate something like a "whitelisting" > spider, with the core community controlling the crawl. (etc.) > > If you spider everything and then rely on the community to weed out the > junk, that's a problem from two perspectives: > > First, the community has to look at a lot of junk to get rid of it, > rather than looking at good stuff to include it. Not very fun. > I don't know - I think any solution where we rely on the community to add content will fail several tests of a "better search engine". Here's where I'm coming from: I have been running a wiki based directory (general like Dmoz, but structured like Wikipedia) for a couple years now (Jimmy, you thought I'd get to be known as a porn site, but I've kept it family friendly and of reasonable quality!). It's working okay - I'm keeping the blatant spammers at bay and dozens of sites that have at least some value are being listed daily. But mostly they have mediocre value. They are there because the webmaster/SEO put them there. Any solution that requires manually added sites will suffer from this kind of "watering down". You may keep out the porn and splogs, but we'll be overwhelmed with mediocrity. (I'm condemning my own business model a little here, and I've got very mixed feelings about that!). Think about what urls got listed (and by whom) in Wikia 2 years ago. I think if we are going to build a search engine, we should go find the pages! Filter them for relevance to the best of our ability, and then let out community do the rest. Have you ever used Craiglist? Seen their system for flagging bad entries? http://sfbay.craigslist.org/about/help/flags_and_community_moderation I'm picturing a search engine that takes input in from the community. We can say "yeah, this search result is a pretty good match for my query" (thumbs up, one vote, whatever), but I'm also picturing a system to vote down spam ("flags"). (Disclosure: I'm working on a combination tagging/flagging social bookmarking service right now.) Thoughts? Aerik -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.wikia.com/pipermail/search-l/attachments/20070105/ee7c8d30/attachment.html From Keith at Botley.net Fri Jan 5 05:03:41 2007 From: Keith at Botley.net (Keith Botley) Date: Fri Jan 5 05:03:48 2007 Subject: [Search-l] The sites people bookmark and advanced search options In-Reply-To: <355a36af0701041743ycbc5f8bo4b8d34098260da20@mail.gmail.com> Message-ID: <00c201c73086$ded65d90$b900a8c0@Botley.local> > I think if we are going to build a search engine, we should go find the pages! Filter them for relevance to the best of our ability, and then let out community do the > rest I think so if we are looking for mass acceptance from the start. On day one we are the same as any other algorithmic based search engine but as they hit a threshold in search relevancy, based on the finiteness of there software, we never cap (or at least I can't think of one) I would think that the major search engines already track click troughs' associated with the string the user entered so they would be assigning relevancy based on the amount of content (text) they display to the user in the search results. Based on this, we would be assigning another second level of non automated filtering based on Human abstract thought to the search results through their direct feedback. (Trying to imagine a workflow here) So eventually the garbage falls out the bottom or gets pulled out immediately by user feedback. So we list then white list then WHITE LIST. A real world e