From jmcc at hackwatch.com Wed Aug 1 02:34:26 2007 From: jmcc at hackwatch.com (John McCormac) Date: Wed, 01 Aug 2007 03:34:26 +0100 Subject: [Search-l] Grub Update In-Reply-To: <46AFE229.1020409@wikia.com> References: <46AFCBCD.8010502@hackwatch.com> <46AFE229.1020409@wikia.com> Message-ID: <46AFF132.20400@hackwatch.com> Jimmy Wales wrote: > One of the first jobs for the OS version of the client is to make > absolutely 100% sure that it behaves itself exquisitely well, both for > the clients and for the sites being crawled. Unfortunately it is not a question of the behaviour of Grub or any other crawler. The owners of large directories and sites tend to be far more aggressive now in protecting their resources. That means that many of them are tired of scrapers and bots and will block anything outside the Google/Yahoo/Microsoft crawlers. Others have blocked entire countries at IP level or by extension. > I think this misunderstands how Grub works. Grub distributes the > crawling and checking to see if sites have changed, it does no > distribute the decisionmaking about which sites to crawl. In this sense > it is much more like Seti at Home than like Gnutella networks or the like. > It is "distributed" not "peer to peer". Again this runs into the "shoot on sight" attitude of some webmasters. The crawler will be seen as coming from dynamic/dialup IP ranges, many of which are already iffy due to scrapers. With the main search engines, the IPs have proper reverse DNS so that webmasters can be certain that they are who they claim to be. > And YES you are 100% right - crawling is only a piece of the search > solution. In theory a distributed crawler can spider the web more > quickly and thoroughly than a centralized solution. And another part of > the theory here is that be reducing the *cost* of a high quality crawl, > it becomes possible to make the *results* of the crawl available under a > free license. (Which, of course, Wikia will do no matter what the cost, > because that's the whole point of what we are doing here.) In June, I spidered the index pages from all active .eu websites from a tracking dataset of .eu domains (approx 1.436M websites out of 1.78M actively resolving domains from a list of 2.13M .eu domains). The aim was to create some estimate of how many active .eu websites there were. The results were quite startling - only about 16.13% of the domains with websites (roughly 19.90% of the websites) were actively developed. The data was then broken down over active websites, parked sites, holding pages, frame src redirects etc. A similar first run on .mobi had only 10% of the websites actively developed and that was before any dupe and holding page algorithms were applied to the data. The problem with building a good index is that this kind of work is never really seen or heard about. The enthusiasts tend to think that they know how search engines work and, to a certain extent, they do. But they do not appreciate what goes into creating and maintaining a high quality search index. This process has to be highly automated to be successful as handling millions of websites is not something that can be done efficiently by hand. The reason that most of these mini search engines fail after eighteen months or so is because they run into the brick wall of the acquisition problem. (Similar to that of the web directories that rely on user submissions.) They have to compete with search engines like Google that are far better equipped and URL detection is not the most efficient way of detecting new sites. Many new sites are not linked. It often takes some time for the linkbacks to appear in directories. And since Google has the greatest footprint, the site owners will often submit them to Google. This gives Google a major head start on the dwindling number of active web directories. The cost of a high quality crawl is probably a magnitude or so lower than those estimates that have been published. Most of the ones I've read fail to take into consideration the numbers of duplicate, PPC, holding pages and assorted junk in an extension. This is the stuff that is removed in the pre-index process. They extrapolate the number of domains to the number of websites and work from there. The reality is that the webspace of most extensions is like a large, bumpy plain with a handful of skyscrapers and a lot of small tents. The interesting thing is that the ccTLDs tend to be different to the TLDs like .com etc. The Irish .ie extension had an active development figure of approximately 57%. I haven't worked out a figure for .uk yet but I would expect it to be somewhat higher than that of .com or .eu. Most of the work in a high quality crawl actually goes into building a high quality index as its starting point. It is then a process of continual refinement. This is why I tend to wonder about distributed search when there is no corresponding thought being put into the critical question of "searching for what?". Regards...jmcc -- ****************************************************** John McCormac * e-mail: jmcc at whoisireland.com MC2 * voice: +353-51-873640 22 Viewmount * web: http://www.whoisireland.com/ Waterford * blog: http://blog.whoisireland.com Ireland * Irish Domain Stats & Market Research ****************************************************** From peter.burden at gmail.com Wed Aug 1 00:13:41 2007 From: peter.burden at gmail.com (peter burden) Date: Wed, 01 Aug 2007 01:13:41 +0100 Subject: [Search-l] big announcement: grub is back! In-Reply-To: References: Message-ID: <46AFD035.805@gmail.com> jer wrote: > I'm sending this as Jimmy is on stage announcing it in his keynote > here at OSCON, telling how Wikia has been working with LookSmart to > acquire the Grub project and open-source it again. > > A lot of work has gone into this process and I want to thank Jimmy, > Gil, and some of the thoughtful folks at LookSmart. > > Grub was open a long time ago, and it is now again. We're going to > be working hard over the coming weeks to get the codebase up on a > repository and keep the service running in a testing mode so people > can start to play with it, so keep an eye on grub.org and it's wiki > page, http://search.wikia.com/wiki/Grub. > Just downloaded and it looks interesting. Nice to see the project moving ahead and as soon as I get it running it'll be interesting to see how it copes with some of the odder sites on the WWW. Noted the use of C++ and SOAP. Just wondering if the same (or similar) effect could be achieved using "wget", a shell script and simple compressing/parsing back end ?? > On the "big vision" map, this is just one project, a distributed > crawler, and all of the contents and results it will compile will be > fully available under an open document license (yay!). > > There is more coming along these lines, more projects, this looks > like an exciting trend :) > > Jer > > _______________________________________________ > Search-l mailing list > Search-l at wikia.com > http://lists.wikia.com/mailman/listinfo/search-l > Change options or unsubscribe: http://lists.wikia.com/mailman/options/search-l > > From seth.ford at gmail.com Wed Aug 1 19:22:03 2007 From: seth.ford at gmail.com (Seth Ford) Date: Wed, 1 Aug 2007 13:22:03 -0600 Subject: [Search-l] Grub Update In-Reply-To: <46AFF132.20400@hackwatch.com> References: <46AFCBCD.8010502@hackwatch.com> <46AFE229.1020409@wikia.com> <46AFF132.20400@hackwatch.com> Message-ID: I couldn't agree more.... if you look at the amount of time Google, Yahoo and Microsoft spend working with websites to improve indexing and the groups they have working improving results to query's through hired communities, I am not sure there is room to play. It would seem more reasonable just to mash in an engine when you need one and let community take on the short tail. Almost like an open About.com approach just opening up to anyone... maybe use wikia.com as the knowledge store to mine (although I don't like the break from UI constancy when your trying to organize the entire web). I bet I could project out how Google would do it... Move Google docs over to a global wiki with a constant UI allow people to create an article on anything (nuggests, pointers where stuff can really be found.... whatever) and then index it and create a new presense that allows you to look there first before turning to a web crawl. Pretty easy to envision the interface and see how powerful it would be if you could get people to organize information not into a encyclopedia but into match of every query to article that acts as a real pointer. From there it's just a matter of writing a great trust API.... Why try to reinvent the wheel when it really doesn't need to be? With enough enough community involvement over time a web crawl would become less and less relevant. Just my two cents.... Seth On 7/31/07, John McCormac wrote: > > Jimmy Wales wrote: > > One of the first jobs for the OS version of the client is to make > > absolutely 100% sure that it behaves itself exquisitely well, both for > > the clients and for the sites being crawled. > > Unfortunately it is not a question of the behaviour of Grub or any other > crawler. The owners of large directories and sites tend to be far more > aggressive now in protecting their resources. That means that many of > them are tired of scrapers and bots and will block anything outside the > Google/Yahoo/Microsoft crawlers. Others have blocked entire countries at > IP level or by extension. > > > I think this misunderstands how Grub works. Grub distributes the > > crawling and checking to see if sites have changed, it does no > > distribute the decisionmaking about which sites to crawl. In this sense > > it is much more like Seti at Home than like Gnutella networks or the like. > > It is "distributed" not "peer to peer". > > Again this runs into the "shoot on sight" attitude of some webmasters. > The crawler will be seen as coming from dynamic/dialup IP ranges, many > of which are already iffy due to scrapers. With the main search engines, > the IPs have proper reverse DNS so that webmasters can be certain that > they are who they claim to be. > > > And YES you are 100% right - crawling is only a piece of the search > > solution. In theory a distributed crawler can spider the web more > > quickly and thoroughly than a centralized solution. And another part of > > the theory here is that be reducing the *cost* of a high quality crawl, > > it becomes possible to make the *results* of the crawl available under a > > free license. (Which, of course, Wikia will do no matter what the cost, > > because that's the whole point of what we are doing here.) > > In June, I spidered the index pages from all active .eu websites from a > tracking dataset of .eu domains (approx 1.436M websites out of 1.78M > actively resolving domains from a list of 2.13M .eu domains). The aim > was to create some estimate of how many active .eu websites there were. > The results were quite startling - only about 16.13% of the domains with > websites (roughly 19.90% of the websites) were actively developed. The > data was then broken down over active websites, parked sites, holding > pages, frame src redirects etc. A similar first run on .mobi had only > 10% of the websites actively developed and that was before any dupe and > holding page algorithms were applied to the data. > > The problem with building a good index is that this kind of work is > never really seen or heard about. The enthusiasts tend to think that > they know how search engines work and, to a certain extent, they do. But > they do not appreciate what goes into creating and maintaining a high > quality search index. This process has to be highly automated to be > successful as handling millions of websites is not something that can be > done efficiently by hand. > > The reason that most of these mini search engines fail after eighteen > months or so is because they run into the brick wall of the acquisition > problem. (Similar to that of the web directories that rely on user > submissions.) They have to compete with search engines like Google that > are far better equipped and URL detection is not the most efficient way > of detecting new sites. Many new sites are not linked. It often takes > some time for the linkbacks to appear in directories. And since Google > has the greatest footprint, the site owners will often submit them to > Google. This gives Google a major head start on the dwindling number of > active web directories. > > The cost of a high quality crawl is probably a magnitude or so lower > than those estimates that have been published. Most of the ones I've > read fail to take into consideration the numbers of duplicate, PPC, > holding pages and assorted junk in an extension. This is the stuff that > is removed in the pre-index process. They extrapolate the number of > domains to the number of websites and work from there. The reality is > that the webspace of most extensions is like a large, bumpy plain with > a handful of skyscrapers and a lot of small tents. The interesting > thing is that the ccTLDs tend to be different to the TLDs like .com etc. > The Irish .ie extension had an active development figure of > approximately 57%. I haven't worked out a figure for .uk yet but I would > expect it to be somewhat higher than that of .com or .eu. > > Most of the work in a high quality crawl actually goes into building a > high quality index as its starting point. It is then a process of > continual refinement. This is why I tend to wonder about distributed > search when there is no corresponding thought being put into the > critical question of "searching for what?". > > Regards...jmcc > -- > ****************************************************** > John McCormac * e-mail: jmcc at whoisireland.com > MC2 * voice: +353-51-873640 > 22 Viewmount * web: http://www.whoisireland.com/ > Waterford * blog: http://blog.whoisireland.com > Ireland * Irish Domain Stats & Market Research > ****************************************************** > > _______________________________________________ > Search-l mailing list > Search-l at wikia.com > http://lists.wikia.com/mailman/listinfo/search-l > Change options or unsubscribe: > http://lists.wikia.com/mailman/options/search-l > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.wikia.com/pipermail/search-l/attachments/20070801/acf74660/attachment.html From peter.burden at gmail.com Wed Aug 1 23:42:06 2007 From: peter.burden at gmail.com (peter burden) Date: Thu, 02 Aug 2007 00:42:06 +0100 Subject: [Search-l] Grub Update In-Reply-To: <46AFF132.20400@hackwatch.com> References: <46AFCBCD.8010502@hackwatch.com> <46AFE229.1020409@wikia.com> <46AFF132.20400@hackwatch.com> Message-ID: <46B11A4E.7080903@gmail.com> John McCormac wrote: > Jimmy Wales wrote: > >> One of the first jobs for the OS version of the client is to make >> absolutely 100% sure that it behaves itself exquisitely well, both for >> the clients and for the sites being crawled. >> > > In which case 'grub' has a long way to go. As far as I can tell from the Microsoft Visual C++ code there is no support for robot exclusion. "robots.txt" is mentioned in a "todo" list and there's a function that recognises "robots.txt" in a URL, but the function dosen't appear to be called anywhere. There's no mention of exclusion via the tag attributes. The use of the "if-modified-since" HTTP request is hinted at in a "todo" list but the code doesn't seem to take advantage of this. I've no idea how it controls per-server traffic, possibly it relies on the "random" selection of sites to "spread the load". Crawling "at random" seems to me a bad idea for a variety of reasons. If the randomness implies random URLs on random sites than, in order to be well behaved the crawler needs to fetch the "robots.txt" file for each site prior to fetching the actual URL creating a significant extra network and server overhead. There will also be overheads from DNS lookups. [Those who have written crawlers that I have looked at seem to have found that DNS can represent a significant bottleneck.] Randomness also prevents the use of cookies as a strategy to crawl dynamic sites. > >> And YES you are 100% right - crawling is only a piece of the search >> solution. In theory a distributed crawler can spider the web more >> quickly and thoroughly than a centralized solution. And another part of >> the theory here is that be reducing the *cost* of a high quality crawl, >> it becomes possible to make the *results* of the crawl available under a >> free license. (Which, of course, Wikia will do no matter what the cost, >> because that's the whole point of what we are doing here.) >> I don't think this theory is necessarily right. If the crawl targets have to be distributed from a central controlling server and the results sent back then the traffic level on the central server is going to be of the same order of magnitude as if the central machine crawled directly. The total network traffic will be greater as the crawled data has to make two network journeys (one from remote site to crawler, one from crawler to central machine). If you don't feed all the results back to a central machine then there's lots of extra network traffic as various factories, brokers and collectors talk to each other to try and find documents (or references to documents) that satisfy search criteria. There's also likely to be duplication of server accesses. If the traffic from crawler to/from central machine is encoded in some form as inefficient as XML then the situation is even worse. > > In June, I spidered the index pages from all active .eu websites from a > tracking dataset of .eu domains (approx 1.436M websites out of 1.78M > actively resolving domains from a list of 2.13M .eu domains). The aim > was to create some estimate of how many active .eu websites there were. > The results were quite startling - only about 16.13% of the domains with > websites (roughly 19.90% of the websites) were actively developed. The > data was then broken down over active websites, parked sites, holding > pages, frame src redirects etc. A similar first run on .mobi had only > And that's only some of the problems. ;-) > 10% of the websites actively developed and that was before any dupe and > holding page algorithms were applied to the data. > > The problem with building a good index is that this kind of work is > never really seen or heard about. The enthusiasts tend to think that > they know how search engines work and, to a certain extent, they do. But > they do not appreciate what goes into creating and maintaining a high > quality search index. This process has to be highly automated to be > successful as handling millions of websites is not something that can be > done efficiently by hand. > > The reason that most of these mini search engines fail after eighteen > months or so is because they run into the brick wall of the acquisition > problem. (Similar to that of the web directories that rely on user > submissions.) They have to compete with search engines like Google that > are far better equipped and URL detection is not the most efficient way > of detecting new sites. Many new sites are not linked. It often takes > some time for the linkbacks to appear in directories. And since Google > has the greatest footprint, the site owners will often submit them to > Google. This gives Google a major head start on the dwindling number of > active web directories. > > Well I'm glad somebody else has spotted this. If you don't believe it note down a few URLs off vans, lorries, buses, yellow pages, local shops etc., and then do an advanced Google search for pages that link to the domain home page. I managed to get a zone transfer of ".org.uk" some time ago and did this check - as I recall I was seeing ~20/30% of active web sites having no incoming links according to Google. > Most of the work in a high quality crawl actually goes into building a > high quality index as its starting point. It is then a process of > continual refinement. This is why I tend to wonder about distributed > search when there is no corresponding thought being put into the > critical question of "searching for what?". > In my experience an equally significant effort is required for setting up and tweaking filters to reject unwanted and irrelevant documents and avoiding any one of several >>interesting<< spider traps. > Regards...jmcc > From jmcc at hackwatch.com Thu Aug 2 04:06:55 2007 From: jmcc at hackwatch.com (John McCormac) Date: Thu, 02 Aug 2007 05:06:55 +0100 Subject: [Search-l] Grub Update In-Reply-To: <46B11A4E.7080903@gmail.com> References: <46AFCBCD.8010502@hackwatch.com> <46AFE229.1020409@wikia.com> <46AFF132.20400@hackwatch.com> <46B11A4E.7080903@gmail.com> Message-ID: <46B1585F.4030408@hackwatch.com> peter burden wrote: > John McCormac wrote: > > And that's only some of the problems. ;-) The .eu is a disaster zone thanks to the incompetence of EURid and the European Commission. People in Europe are concentrating on the ccTLDs and the .com TLD as their primary business brands and ignoring .eu in droves. (But I've already written about this elsewhere. ;) and have even been quoted in Wired. ) I think that the main, high quality, growth over the next few years will be in the ccTLDs and that is one of Google's main vulnerabilities and it is one battlefield on which a new search engine can defeat Google and the other large search engines. > and did this check - as I recall I was seeing ~20/30% of active web sites > having no incoming links according to Google. Extrapolating that to the main part of .uk (.co.uk) and the ccTLDs would explain some of Google's (and that of the other main search engines) performance in ccTLD search. Even the TLDs would have millions of websites that the conventional search engines miss. > In my experience an equally significant effort is required for setting > up and > tweaking filters to reject unwanted and irrelevant documents and avoiding > any one of several >>interesting<< spider traps. Running a good search engine is an on-going task. The more I look at this project, the more I wonder if it is just an idea without a real plan. Perhaps it is just at too early a stage and people are still caught up in the buzz of a new venture. I don't know if the aquisition of Grub involved cash. But if it did, it may have been a case of Wikia having had more money than sense. It will need a major overhaul to make it useful. Regards...jmcc -- ****************************************************** John McCormac * e-mail: jmcc at whoisireland.com MC2 * voice: +353-51-873640 22 Viewmount * web: http://www.whoisireland.com/ Waterford * blog: http://blog.whoisireland.com Ireland * Irish Domain Stats & Market Research ****************************************************** From sethf at sethf.com Thu Aug 2 04:23:42 2007 From: sethf at sethf.com (Seth Finkelstein) Date: Thu, 2 Aug 2007 00:23:42 -0400 Subject: [Search-l] Grub Update In-Reply-To: <46AFF132.20400@hackwatch.com> References: <46AFCBCD.8010502@hackwatch.com> <46AFE229.1020409@wikia.com> <46AFF132.20400@hackwatch.com> Message-ID: <20070802042342.GA27063@sethf.com> I've been keeping my head down on this list since the last brouhaha, but I find myself in the odd position of being the most "optimistic" about Wikia Search in the set of people who aren't just worshipfully echoing the hype. So ... On Wed, Aug 01, 2007 at 03:34:26AM +0100, John McCormac wrote: > The reason that most of these mini search engines fail after eighteen > months or so is because they run into the brick wall of the acquisition > problem. (Similar to that of the web directories that rely on user > submissions.) I suggest there's a path that Wikia Search doesn't have to really be a "Google-killer" in order to be very profitable. All of that is PR and sales-pitch, but it's not necessary for "success" as defined in investment terms. That is, if Wikia ends up with a search engine that's superior on the topics of computer hardware, comics, anime, science fiction, Star Trek, Star Wars, and porn (reflecting the interests of the demographic which will likely be contributing intensively ...), though awful on everything else, that's still probably worth a lot of money in targeted advertising sales. Read up on "wikigroaning" to see what I mean: http://www.wordspy.com/words/Wikigroaning.asp Also, keep in mind that there's a whole network of digital-sharecropping electronic plantations, excuse me, I mean Wikia "community sites", which can be used *both ways* to eventually support Wikia Search - as recruiting material for free workers and data to build the high-quality index, and as a partner market for users of the search engine. I don't think that *Wikipedia* itself would be used that way, due to legal requirements, but all the commercial Wikia sites are another story. So URL detection is partially solved by data-mining the Wikia sites, or relying on people *in that group* to know of new good sites for that *particular area*. Think of it as effectively a vertical search for "geek topics". That's a much smaller domain than Google. What's so unusual here is that most small search startups begin with technology, then try to get an audience. But this project is the reverse, beginning with marketing, and trying to have the *audience* build the technology and everything else. -- Seth Finkelstein Consulting Programmer http://sethf.com/ Infothought blog - http://sethf.com/infothought/blog/ Interview: http://sethf.com/essays/major/greplaw-interview.php From jmcc at hackwatch.com Thu Aug 2 07:48:37 2007 From: jmcc at hackwatch.com (John McCormac) Date: Thu, 02 Aug 2007 08:48:37 +0100 Subject: [Search-l] Grub Update Message-ID: <46B18C55.4000308@hackwatch.com> (Replying to Seth Finkelstein's post which hasn't turned up here yet.) "I suggest there's a path that Wikia Search doesn't have to really be a "Google-killer" in order to be very profitable. All of that is PR and sales-pitch, but it's not necessary for "success" as defined in investment terms. That is, if Wikia ends up with a search engine that's superior on the topics of computer hardware, comics, anime, science fiction, Star Trek, Star Wars, and porn (reflecting the interests of the demographic which will likely be contributing intensively ...), though awful on everything else, that's still probably worth a lot of money in targeted advertising sales." That is Wikia's strength - a large group of micro-search engines rather than one huge, generic, search engine like Google etc. The investment and profitability angle might be iffy. "So URL detection is partially solved by data-mining the Wikia sites, or relying on people *in that group* to know of new good sites for that *particular area*" Mining Wikipedia or Wikia sites will only produce a very limited, though relatively high quality, set of sites. It will not solve the problem of near real-time website acquisition. It is really just repeating the process of using the Dmoz dump as a seed but with a dump of URLs from Wikipedia etc. "What's so unusual here is that most small search startups begin with technology, then try to get an audience. But this project is the reverse, beginning with marketing, and trying to have the *audience* build the technology and everything else." I've seen search engine development by press release before. :) Most of those ventures never even made it to launch. Those that launched crashed ignominiously after failing to get an audience. Having the audience build the technology is somewhat innovative. Not so much an "if you build it they will come" approach but rather a "you build it and bring your own beer" one. Open Sourcing the resultant data is a nice idea. It just isn't a killer app (to use the old dot.bomb terminology). The costs of bandwidth and storage have fallen but the search engine expertise pool has been heavily fished by the big search engines. Those that don't work for the larger search engines run their own small search engines. The search engine business is highly competitive - moreso than it appears. While the whole "wisdom of crowds" thing is great, it would help if the crowd in question had more wisdom about the search business. Regards...jmcc -- ****************************************************** John McCormac * e-mail: jmcc at whoisireland.com MC2 * voice: +353-51-873640 22 Viewmount * web: http://www.whoisireland.com/ Waterford * blog: http://blog.whoisireland.com Ireland * Irish Domain Stats & Market Research ****************************************************** From wsurowiec at gmail.com Thu Aug 2 13:53:10 2007 From: wsurowiec at gmail.com (William Surowiec) Date: Thu, 02 Aug 2007 09:53:10 -0400 Subject: [Search-l] interesting papers at Banff 2007 Message-ID: <46B1E1C6.1060003@gmail.com> http://www2007.org/prog-Papers.php From jeremie at jabber.org Thu Aug 2 18:01:51 2007 From: jeremie at jabber.org (jer) Date: Thu, 2 Aug 2007 13:01:51 -0500 Subject: [Search-l] Grub Update In-Reply-To: <46AFCBCD.8010502@hackwatch.com> References: <46AFCBCD.8010502@hackwatch.com> Message-ID: <64E967B5-60AF-43BF-95E6-BDE8412BE65A@jabber.org> > I'm a bit new to this wikia search thing but the concept of using Grub > is a bit confusing. It is almost an implementation of the Infinite > Number of Monkeys approach to spidering the web. It still requires a > powerful backend to make sense of all the data spidered and that was > always Grub's flaw. Correct, this is just one portion of a search platform, this is entirely the content side. There's still deep/big things to be done after this, but you can't get to there without having a strong content foundation first. > The state of the web has changed since Grub was a player. Most of the > larger sites now block spidering by DSL and dialup connections. Most of the larger sites are easy to crawl as well. > Some > directories block on User Agent and from what I remember, Grub was one > string that used to get blocked a lot. Yep, this needs to be addressed to make sure Grub is behaving properly. It being an open source project now will obviously make any naughty behaviour at least a lot more transparent. > This is the other flaw with a Grub approach - there is no quality > assurance of the index. Many small search engines have followed the > Infinite Monkeys approach to indexing, following each URL to find > more. > The problem with this approach is that it relies on the back end to > give > the data context. They tend to last about 18 months on average. There's another side to Grub that we've not talked about at all, and that's very much related to creating and increasing the quality of the work that Grub is doing. The web interface and user management in Grub right now is pretty straight forward, and lacks the tools you'd need to monitor and increase the quality of the crawling. We need to start a big discussion about a mash-up of Grub and a wiki, where the wiki is serving as the primary driver and dashboard of all activities. This Grub+wiki would be the first step for a social framework to manage an open crawling/content foundation for search. The wiki would have some fields to help direct crawlers, including blocking, timing, depth, discovery, frequency, etc. The Grub results would be correlated back into the wiki so that crawl samples are easily checked and problems can be discovered quickly. All of the intelligence in the wiki and all of the crawl output will be available under an open doc license, and then begins the next stage, building out a platform for bulk access and processing, *grin*. > It should be interesting to see how things turn out. Indeed :) Jer From jeremie at jabber.org Thu Aug 2 18:21:26 2007 From: jeremie at jabber.org (jer) Date: Thu, 2 Aug 2007 13:21:26 -0500 Subject: [Search-l] Grub Update In-Reply-To: <46AFF132.20400@hackwatch.com> References: <46AFCBCD.8010502@hackwatch.com> <46AFE229.1020409@wikia.com> <46AFF132.20400@hackwatch.com> Message-ID: <4B78A6C2-A0C5-4614-B76B-CFE3E5A8B0DF@jabber.org> > The problem with building a good index is that this kind of work is > never really seen or heard about. The enthusiasts tend to think that > they know how search engines work and, to a certain extent, they > do. But > they do not appreciate what goes into creating and maintaining a high > quality search index. This process has to be highly automated to be > successful as handling millions of websites is not something that > can be > done efficiently by hand. I've been through the same process and very much want to point out something: It's *both* high automation *and* human oversight/ tweaking. Exactly what you did, and I did, and everyone who's done any amount of serious crawling has to do, is add a crap-ton of human intelligence to the massive automation process, with constant feedback as things jut out here and there. This should be shared and open, we all shouldn't have to be doing this independently. It's part of Jimmy's vision to have an open wiki serve as the social gathering and common ground for the human side of this crawling equation. I very much believe in it as a great way to move the whole search industry beyond everyone having to re-do all this work. > The reason that most of these mini search engines fail after eighteen > months or so is because they run into the brick wall of the > acquisition > problem. (Similar to that of the web directories that rely on user > submissions.) They have to compete with search engines like Google > that > are far better equipped and URL detection is not the most efficient > way > of detecting new sites. Many new sites are not linked. It often takes > some time for the linkbacks to appear in directories. And since Google > has the greatest footprint, the site owners will often submit them to > Google. This gives Google a major head start on the dwindling > number of > active web directories. Perhaps we should all be working together and sharing resources, so that any value or uniqueness that a new "mini" engine might add isn't lost in the noise of duplicating all the other effort to get there. > The cost of a high quality crawl is probably a magnitude or so lower > than those estimates that have been published. Most of the ones I've > read fail to take into consideration the numbers of duplicate, PPC, > holding pages and assorted junk in an extension. This is the stuff > that > is removed in the pre-index process. They extrapolate the number of > domains to the number of websites and work from there. The reality is > that the webspace of most extensions is like a large, bumpy plain with > a handful of skyscrapers and a lot of small tents. The interesting > thing is that the ccTLDs tend to be different to the TLDs like .com > etc. > The Irish .ie extension had an active development figure of > approximately 57%. I haven't worked out a figure for .uk yet but I > would > expect it to be somewhat higher than that of .com or .eu. Thanks for sharing what you're learning, I wish everyone was this open about their own discoveries even if only informally or in discussions like this. > Most of the work in a high quality crawl actually goes into building a > high quality index as its starting point. It is then a process of > continual refinement. This is why I tend to wonder about distributed > search when there is no corresponding thought being put into the > critical question of "searching for what?". I'm more of a platform guy, and want to build a great foundation to let anyone and everyone answer the "searching for what" question, building common methods for feedback to ensure quality is happening for everyone equally, not a specific type of search application. Jer From jeremie at jabber.org Thu Aug 2 18:39:41 2007 From: jeremie at jabber.org (jer) Date: Thu, 2 Aug 2007 13:39:41 -0500 Subject: [Search-l] Grub Update In-Reply-To: <46B11A4E.7080903@gmail.com> References: <46AFCBCD.8010502@hackwatch.com> <46AFE229.1020409@wikia.com> <46AFF132.20400@hackwatch.com> <46B11A4E.7080903@gmail.com> Message-ID: <5533B20B-A211-4621-ADED-11B615F2E905@jabber.org> > In which case 'grub' has a long way to go. As far as I can tell > from the > Microsoft Visual C++ > code there is no support for robot exclusion. "robots.txt" is > mentioned > in a "todo" > list and there's a function that recognises "robots.txt" in a URL, but > the function > dosen't appear to be called anywhere. I believe there's some server side function involved in managing the robots processing too, but I'm still trying to learn the entire thing and can't really be authoritative yet... maybe I can see if Igor will jump in on this thread :) > There's no mention of exclusion > via the > tag attributes. The use of the "if-modified-since" HTTP > request is > hinted at in a "todo" list but the code doesn't seem to take advantage > of this. > I've no idea how it controls per-server traffic, possibly it relies on > the "random" > selection of sites to "spread the load". I believe you're right. It also doesn't take advantage of keepalives. > Crawling "at random" seems to me a bad idea for a variety of > reasons. If the > randomness implies random URLs on random sites than, in order to be > well > behaved the crawler needs to fetch the "robots.txt" file for each site > prior to > fetching the actual URL creating a significant extra network and > server > overhead. > There will also be overheads from DNS lookups. [Those who have written > crawlers that I have looked at seem to have found that DNS can > represent > a significant bottleneck.] > > Randomness also prevents the use of cookies as a strategy to crawl > dynamic sites. *nod*, all agreed. > I don't think this theory is necessarily right. If the crawl targets > have to be distributed > from a central controlling server and the results sent back then the > traffic level on the > central server is going to be of the same order of magnitude as if the > central machine > crawled directly. The total network traffic will be greater as the > crawled data has to > make two network journeys (one from remote site to crawler, one from > crawler to > central machine). While I don't want to start a whole thread about this particular point, I mildly disagree on the most basic part, in that a bunch of pages packed together and compressed is a lot easier to just stream/ dump onto a completely dumb file server, the model isn't complete duplication even at the purest level. Where I really start to believe in the distributed crawling is when the clients get more intelligent, recognizing 404 pages, junk pages, spider traps, common patterns (parked pages), and so on. Secondarily, they can do some "lightweight" indexing, breaking out the links, titles, etc and provide them in a structured form along with the compressed package back to the big storage area. IMO it would be nice if they could perform normalization on the content too, but that's much more questionable and won't become a thread until there's a big repository and people trying to work with it. > In my experience an equally significant effort is required for setting > up and > tweaking filters to reject unwanted and irrelevant documents and > avoiding > any one of several >>interesting<< spider traps. Agreed, I'd love to see this knowledge be part of the public commons. Jer From jeremie at jabber.org Thu Aug 2 19:05:52 2007 From: jeremie at jabber.org (jer) Date: Thu, 2 Aug 2007 14:05:52 -0500 Subject: [Search-l] interesting papers at Banff 2007 In-Reply-To: <46B1E1C6.1060003@gmail.com> References: <46B1E1C6.1060003@gmail.com> Message-ID: <3B5ADE61-25D1-4205-9BB7-9768BC782D83@jabber.org> Some excellent stuff there, thanks for the link... I've never heard of "Isotonic Smoothing" so I already learned something from the titles alone :) Jer On Aug 2, 2007, at 8:53 AM, William Surowiec wrote: > http://www2007.org/prog-Papers.php > > _______________________________________________ > Search-l mailing list > Search-l at wikia.com > http://lists.wikia.com/mailman/listinfo/search-l > Change options or unsubscribe: http://lists.wikia.com/mailman/ > options/search-l From jmcc at hackwatch.com Thu Aug 2 21:18:05 2007 From: jmcc at hackwatch.com (John McCormac) Date: Thu, 02 Aug 2007 22:18:05 +0100 Subject: [Search-l] Grub Update In-Reply-To: <4B78A6C2-A0C5-4614-B76B-CFE3E5A8B0DF@jabber.org> References: <46AFCBCD.8010502@hackwatch.com> <46AFE229.1020409@wikia.com> <46AFF132.20400@hackwatch.com> <4B78A6C2-A0C5-4614-B76B-CFE3E5A8B0DF@jabber.org> Message-ID: <46B24A0D.3000606@hackwatch.com> jer wrote: > I've been through the same process and very much want to point out > something: It's *both* high automation *and* human oversight/ tweaking. > Exactly what you did, and I did, and everyone who's done any amount of > serious crawling has to do, is add a crap-ton of human intelligence to > the massive automation process, with constant feedback as things jut > out here and there. Yes Jer, but you don't know what I've done and vice versa. So I don't know if we have been through quite the same process. :) The thing about search on this or any other significant scale is that it requires a completely different mindset to that required for building a web directory or wiki where each entry can be individually validated. I don't think that adding a 'crap-ton' of human intelligence to the process is an accurate description of what happens. The indications that mark a site for deletion tend to be clear and it is the speed on which this happens that is important. Sometimes, this has to be applied to every website on an IP or even on the same DNS. It is a very anti-democratic process. Some are easy wins - linkswamps that can be identified by a DNS or IP. PPC that can be identified from a particular string, duplicate content pages that all have the same MD5 hash etc. The hard part is when it goes beyond the easy wins to the stuff that requires a human decision. > This should be shared and open, we all shouldn't have to be doing this > independently. It's part of Jimmy's vision to have an open wiki serve > as the social gathering and common ground for the human side of this > crawling equation. I very much believe in it as a great way to move > the whole search industry beyond everyone having to re-do all this work. Some of us (those lucky enough to survive in the search engine wars) have been doing this kind of work independently for years. We do talk to each other but there is a slight attitude of "better him than me" when some other search engine venture goes dot.bomb. Some of the techniques and methodology of search engine development are closely held - none more closely than a good search index. The tools for building search engines are widely available (Nutch etc). It is the human element of the equation that is in short supply. Many on the second and third tiers (those below GYM (Google/Yahoo/Microsoft)) of the search business have been talking on internet fora and lists for years. Having spent years developing good a search index, many of these people would not particularly want to give up such an edge. Though the wiki idea is nice, the mindset is somewhat different to that of Wikipedia and the whole "Cathederal and the Bazzar" model. Most search engine developers are too busy trying to survive without having to subscribe to some happy-clappy ethos that could very well put them out of business. These are the guys who you will have to convince that there is some value to being involved in the Wikia search project. > Perhaps we should all be working together and sharing resources, so > that any value or uniqueness that a new "mini" engine might add isn't > lost in the noise of duplicating all the other effort to get there. That's all very laudable but this is a business. The small search engines are not going to hand over their survival edge to Jimmy's vision, which is essentially that of a competitor who will take their work an monetise it. That is the road block that the project has to get beyond. > I'm more of a platform guy, and want to build a great foundation to let > anyone and everyone answer the "searching for what" question, building > common methods for feedback to ensure quality is happening for everyone > equally, not a specific type of search application. But without that essential spark of the search engine developers, there is a danger that the project could just be another platform - much like Amazon's search and servers product. Being a search engine developer is not the same as being a webdeveloper. There is a lot more thinking and learning involved. Most thinking is about the "searching for what" question. It defines the nature of the search engine being developed. It makes the search engine a macro search engine or a niche engine. It makes the difference between success and failure. Having a platform for open search is nice. It might attract some search engine developers. Having a real search idea to go with that platform is better. Is Wikia search just an open platform without an idea for a search application? Regards...jmcc -- ****************************************************** John McCormac * e-mail: jmcc at whoisireland.com MC2 * voice: +353-51-873640 22 Viewmount * web: http://www.whoisireland.com/ Waterford * blog: http://blog.whoisireland.com Ireland * Irish Domain Stats & Market Research ****************************************************** From jmcc at hackwatch.com Thu Aug 2 21:56:21 2007 From: jmcc at hackwatch.com (John McCormac) Date: Thu, 02 Aug 2007 22:56:21 +0100 Subject: [Search-l] Grub Update In-Reply-To: <5533B20B-A211-4621-ADED-11B615F2E905@jabber.org> References: <46AFCBCD.8010502@hackwatch.com> <46AFE229.1020409@wikia.com> <46AFF132.20400@hackwatch.com> <46B11A4E.7080903@gmail.com> <5533B20B-A211-4621-ADED-11B615F2E905@jabber.org> Message-ID: <46B25305.7020601@hackwatch.com> jer wrote: > Where I really start to believe in the distributed crawling is when the > clients get more intelligent, recognizing 404 pages, junk pages, spider > traps, common patterns (parked pages), and so on. This is the danger of confusing the function of crawlers with that of the search backend. The key to a fast and efficient crawl is that the crawler is streamlined and handles as many pages as possible in as short a time as possible. Breaking out to parse html is processor intensive and slows down crawling considerably. Regards...jmcc -- ****************************************************** John McCormac * e-mail: jmcc at whoisireland.com MC2 * voice: +353-51-873640 22 Viewmount * web: http://www.whoisireland.com/ Waterford * blog: http://blog.whoisireland.com Ireland * Irish Domain Stats & Market Research ****************************************************** From peter.burden at gmail.com Thu Aug 2 23:33:29 2007 From: peter.burden at gmail.com (peter burden) Date: Fri, 03 Aug 2007 00:33:29 +0100 Subject: [Search-l] Grub Update In-Reply-To: <46B25305.7020601@hackwatch.com> References: <46AFCBCD.8010502@hackwatch.com> <46AFE229.1020409@wikia.com> <46AFF132.20400@hackwatch.com> <46B11A4E.7080903@gmail.com> <5533B20B-A211-4621-ADED-11B615F2E905@jabber.org> <46B25305.7020601@hackwatch.com> Message-ID: <46B269C9.8070206@gmail.com> John McCormac wrote: > jer wrote: > >> Where I really start to believe in the distributed crawling is when the >> clients get more intelligent, recognizing 404 pages, junk pages, spider >> traps, common patterns (parked pages), and so on. >> > > This is the danger of confusing the function of crawlers with that of > the search backend. The key to a fast and efficient crawl is that the > crawler is streamlined and handles as many pages as possible in as short > a time as possible. Breaking out to parse html is processor intensive > and slows down crawling considerably. > This may be the key to fast crawling but I don't think it is the key to efficient crawling. Efficient crawling requires attention to all the points Jer mentions and many more, most of them require parsing the HTML and performing various analyses. I'd rather have an efficient polite crawler than a fast crawler - I'd really like a crawler that was both of course ;-) Fetching and parsing both have to be done, they could be done on separate machines but this does impose an overhead of extra network traffic. I suppose it ultimately depends on how distribution is actually organised. In my (limited) experience parsing HTML is not particularly CPU intensive - what is a pain is checking for duplicate pages, alias host names, detecting various spider traps, deciding whether a URL represents a page already fetched and maintaining the various data structures. > Regards...jmcc > From jwales at wikia.com Fri Aug 3 00:28:26 2007 From: jwales at wikia.com (Jimmy Wales) Date: Fri, 03 Aug 2007 08:28:26 +0800 Subject: [Search-l] Grub Update In-Reply-To: <46B269C9.8070206@gmail.com> References: <46AFCBCD.8010502@hackwatch.com> <46AFE229.1020409@wikia.com> <46AFF132.20400@hackwatch.com> <46B11A4E.7080903@gmail.com> <5533B20B-A211-4621-ADED-11B615F2E905@jabber.org> <46B25305.7020601@hackwatch.com> <46B269C9.8070206@gmail.com> Message-ID: <46B276AA.7010607@wikia.com> peter burden wrote: > This may be the key to fast crawling but I don't think it is the key to > efficient > crawling. Efficient crawling requires attention to all the points Jer > mentions > and many more, most of them require parsing the HTML and performing > various analyses. I'd rather have an efficient polite crawler than a > fast crawler - I'd really like a crawler that was both of course ;-) And this is the real potential strength of a distributed approach, I think. With a small number of crawling machines, you perhaps have to fetch fetch fetch fetch without a lot of "thinking". But with 10,000 or 100,000 or 1,000,000 machines pitching in? I am not sure what the best architecture for that will end up being -- that's an empirical question and I don't think we have enough experience yet, any of us, to really know the answer. So, we move forward and learn. :) > In my (limited) experience parsing HTML is not particularly CPU > intensive - what is a pain is checking for duplicate pages, alias > host names, detecting various spider traps, deciding whether a > URL represents a page already fetched and maintaining > the various data structures. *nod* --Jimbo From jmcc at hackwatch.com Fri Aug 3 00:41:45 2007 From: jmcc at hackwatch.com (John McCormac) Date: Fri, 03 Aug 2007 01:41:45 +0100 Subject: [Search-l] Grub Update In-Reply-To: <46B269C9.8070206@gmail.com> References: <46AFCBCD.8010502@hackwatch.com> <46AFE229.1020409@wikia.com> <46AFF132.20400@hackwatch.com> <46B11A4E.7080903@gmail.com> <5533B20B-A211-4621-ADED-11B615F2E905@jabber.org> <46B25305.7020601@hackwatch.com> <46B269C9.8070206@gmail.com> Message-ID: <46B279C9.9040508@hackwatch.com> peter burden wrote: > and many more, most of them require parsing the HTML and performing > various analyses. I'd rather have an efficient polite crawler than a > fast crawler - I'd really like a crawler that was both of course ;-) Naturally. :) But there are so many elements in parsing a PPC or holding page or other junk that it does slow things down if done as part of the crawl. > Fetching and parsing both have to be done, they could be done on > separate machines but this does impose an overhead of extra > network traffic. I suppose it ultimately depends on how distribution > is actually organised. There hasn't been much talk on the list about this. The data still has to be transferred to a central site and processed to make it searchable. Not having the raw data can be a problem when it comes to fault finding. > In my (limited) experience parsing HTML is not particularly CPU > intensive - what is a pain is checking for duplicate pages, alias > host names, detecting various spider traps, deciding whether a > URL represents a page already fetched and maintaining > the various data structures. A lot of that can be handled at the pre-Index stage. There other tricks that allow pages to be compared. Parsing HTML is almost straightforward as it does not really vary from page to page. (Though people do tend to break it in interesting ways.) Looking for a series of specific strings in the HTML can slow things down. The more I look at the problem, I wonder if it might be better just to use some kind wget like program that respects robots and concentrating the parsing on the filesystem. The organisation of a distributed crawl is the key. Letting the spiders loose and hoping for the best is not the way to go about things. Regards...jmcc -- ****************************************************** John McCormac * e-mail: jmcc at whoisireland.com MC2 * voice: +353-51-873640 22 Viewmount * web: http://www.whoisireland.com/ Waterford * blog: http://blog.whoisireland.com Ireland * Irish Domain Stats & Market Research ****************************************************** From jmcc at hackwatch.com Fri Aug 3 00:53:03 2007 From: jmcc at hackwatch.com (John McCormac) Date: Fri, 03 Aug 2007 01:53:03 +0100 Subject: [Search-l] Grub Update In-Reply-To: <46B276AA.7010607@wikia.com> References: <46AFCBCD.8010502@hackwatch.com> <46AFE229.1020409@wikia.com> <46AFF132.20400@hackwatch.com> <46B11A4E.7080903@gmail.com> <5533B20B-A211-4621-ADED-11B615F2E905@jabber.org> <46B25305.7020601@hackwatch.com> <46B269C9.8070206@gmail.com> <46B276AA.7010607@wikia.com> Message-ID: <46B27C6F.9030204@hackwatch.com> Jimmy Wales wrote: > And this is the real potential strength of a distributed approach, I > think. With a small number of crawling machines, you perhaps have to > fetch fetch fetch fetch without a lot of "thinking". But with 10,000 or > 100,000 or 1,000,000 machines pitching in? If anything, it would require a lot more thinking at the intial stage as the handling process has to be well designed. This would be a largely asynchronous crawl with crawlers dropping in and out of the network at a far greater rate than would happen with dedicated crawlers. The crawl is not a one-off thing either. It is an ongoing process that involves a number of search indices. Some of them will be live and others will be development indices. > I am not sure what the best architecture for that will end up being -- > that's an empirical question and I don't think we have enough experience > yet, any of us, to really know the answer. So, we move forward and > learn. :) Well there are a lot of possibilities. The trick is chosing the right one and turning into reality. Regards...jmcc -- ****************************************************** John McCormac * e-mail: jmcc at whoisireland.com MC2 * voice: +353-51-873640 22 Viewmount * web: http://www.whoisireland.com/ Waterford * blog: http://blog.whoisireland.com Ireland * Irish Domain Stats & Market Research ****************************************************** From bengeeknet at gmail.com Fri Aug 3 01:15:43 2007 From: bengeeknet at gmail.com (BenGeeK) Date: Thu, 02 Aug 2007 20:15:43 -0500 Subject: [Search-l] I'm new here Message-ID: <1186103743.13029.7.camel@bengeeknet> I'm new at this place and i just wanted to say hello, so... Hello everyone!, my name is Benjamin Aquino, but you can tell me BenGeeK. I hope i can help on this project, i'll start reading what are you doing at this point and then ask you how can i help, but today i just wanted to tell you who i am. P.D.: I'm sorry if my english is bad, but i'll improve it From george.mckeon at ghq.com Fri Aug 3 02:03:07 2007 From: george.mckeon at ghq.com (george.mckeon at ghq.com) Date: Fri, 03 Aug 2007 03:03:07 +0100 Subject: [Search-l] Micro search engine / semi-wiki ? Message-ID: <20070803030307.xhos2unmrok00c4w@82.195.128.132> John / Jim I have read your comments and emails on the wikia.com project with interest. "That is Wikia's strength - a large group of micro-search engines rather than one huge, generic, search engine like Google etc. The investment and profitability angle might be iffy." and more. I would like to contribute to the discussion but I do not know what I have developed in the current language of "wiki". I have been doing the GHQ.com project for a few years and developed the current live site at GHQ.com http://www.ghq.com Basically it is intended to be a global search engines for weddings and celebrations where content is contributed on a semi wiki basis. Obviously our content will have to increase substantially. My intent was to create a search engine that is on specific real events and contributed on a semi-wiki basis. My question is: Does this qualify as a type of "micro-search engine" or a type of "wiki" ? Or would I merely be seen as someone trying for self promotion of their own project if I tried to contribute to the discussion ? Do I have anything to contribute to the wikia.com project in the GHQ.com project ? Many thanks for your consideration. Regards George ____________________________________________________ George Mc Keon Voice Int. 618 9317 1325 "GHQ.com - Browse Weddings - Add Congratulations" http://www.ghq.com ____________________________________________________ PS: Originally from Drogheda. Sorry about your non existent summer. Note: Weddings and real family live events was an obvious start point as they happen to most. It is intended to increase the scope over time, however restricting it to human events only. From martinolazz at cox.net Fri Aug 3 02:23:49 2007 From: martinolazz at cox.net (martino) Date: Thu, 2 Aug 2007 22:23:49 -0400 Subject: [Search-l] www.greenbbb.org Message-ID: <014f01c7d575$53db4f40$d7c66e44@martinofk34yz5> I'd like to add this domain to the index but can't get that to work. Can you add it for me? martino -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.wikia.com/pipermail/search-l/attachments/20070802/8bb7aec6/attachment.html From jmcc at hackwatch.com Fri Aug 3 05:04:26 2007 From: jmcc at hackwatch.com (John McCormac) Date: Fri, 03 Aug 2007 06:04:26 +0100 Subject: [Search-l] Micro search engine / semi-wiki ? In-Reply-To: <20070803030307.xhos2unmrok00c4w@82.195.128.132> References: <20070803030307.xhos2unmrok00c4w@82.195.128.132> Message-ID: <46B2B75A.70607@hackwatch.com> george.mckeon at ghq.com wrote: > John / Jim > > I have been doing the GHQ.com project for a few years and developed the > current live site at GHQ.com http://www.ghq.com > > Basically it is intended to be a global search engines for weddings and > celebrations where content is contributed on a semi wiki basis. > Obviously our content will have to increase substantially. It is a micro search engine with a clearly defined niche George, That makes it a lot easier to monetise than a generic search engine. The other important distinction is that it exists. > My intent was to create a search engine that is on specific real events > and contributed on a semi-wiki basis. > > My question is: Does this qualify as a type of "micro-search engine" or > a type of "wiki" ? It looks like a micro-search engine with a clearly identified market that is built with a wiki. It is a very good example of an ideal Wikia type search project. For this kind of project, the wiki aspect adds a layer of context to the data in a way that an ordinary search engine could not. Search engines like this have a far better chance of becoming authority sites in their niche markets than generic search engines like Google. They remove a lot of the noise from the search results and make it easier for the user. It also has an element of social networking and the way in which the events are grouped solves the geolocation aspect of search. (One of the big problems in the search business is in correlating websites with their locations in the real world.) > PS: Originally from Drogheda. > Sorry about your non existent summer. Things seem to be getting better. It didn't rain today - yet. (Though it is only 0604 in the morning.) :) Regards...jmcc -- ****************************************************** John McCormac * e-mail: jmcc at whoisireland.com MC2 * voice: +353-51-873640 22 Viewmount * web: http://www.whoisireland.com/ Waterford * blog: http://blog.whoisireland.com Ireland * Irish Domain Stats & Market Research ****************************************************** From andreengels at gmail.com Fri Aug 3 09:53:16 2007 From: andreengels at gmail.com (Andre Engels) Date: Fri, 3 Aug 2007 11:53:16 +0200 Subject: [Search-l] Micro search engine / semi-wiki ? In-Reply-To: <6faf39c90708030252y22fde617q2335a7df9e3cb927@mail.gmail.com> References: <20070803030307.xhos2unmrok00c4w@82.195.128.132> <6faf39c90708030252y22fde617q2335a7df9e3cb927@mail.gmail.com> Message-ID: <6faf39c90708030253i2b7018b9h7e93635e5b4591c@mail.gmail.com> 2007/8/3, george.mckeon at ghq.com : > I would like to contribute to the discussion but I do not know what I > have developed in the current language of "wiki". > > I have been doing the GHQ.com project for a few years and developed > the current live site at GHQ.com http://www.ghq.com > > Basically it is intended to be a global search engines for weddings > and celebrations where content is contributed on a semi wiki basis. > Obviously our content will have to increase substantially. > > My intent was to create a search engine that is on specific real > events and contributed on a semi-wiki basis. > > My question is: Does this qualify as a type of "micro-search engine" > or a type of "wiki" ? > > Or would I merely be seen as someone trying for self promotion of > their own project if I tried to contribute to the discussion ? > > Do I have anything to contribute to the wikia.com project in the > GHQ.com project ? To me this is closer to a wiki than to a search engine as I think we are discussing here. The big difference is that your search engine is only for searching on your own site, it is a local search engine on a specific database. It is closer to a wiki - enabling people to put their own material on your site to share it with the world. However, it differs from a wiki in that people cannot later change other people's pages, so I guess it would be better to put it under the more general umbrella of 'Web 2.0'. To get into the micro-search engine as I see them (not sure about John), you would have to have to enable people to search for your specific type of content if it is elsewhere on the web, not just your own site. Which would give the new problems of finding the content and selecting whether it is applicable - which are the places where I see cooperation and help from volunteers come in. -- Andre Engels, andreengels at gmail.com ICQ: 6260644 -- Skype: a_engels From oss.net at cox.net Fri Aug 3 11:52:37 2007 From: oss.net at cox.net (Robert Steele, CEO EIN/OSS) Date: Fri, 3 Aug 2007 07:52:37 -0400 Subject: [Search-l] Grug, CISCO AON, and Amazon Message-ID: <006a01c7d5c4$c9d2f320$65410a0a@OSSTVL> Herewith for the information of the group, the text of a 1 August letter to CEO CISCO. Mr. John Chambers Chief Executive Officer Cisco Systems, Inc. 170 West Tasman Dr. San Jose, CA 95134 USA Dear Mr. Chambers, It has been my privilege to be a pioneer of sorts, exploring for the past twenty years the reality that most capitals are operating on two percent of the relevant information, and the emerging possibilities of what happens when every person can have access to all information in all languages all the time. I have the deepest admiration for CISCO AON and for Bill Ruh, who was the person responsible for creating the Marine Corps Intelligence Center (today a Command) systems, when I was the senior civilian and deputy director. CISCO AON is inspiring, and the purpose of this letter to articulate for your consideration an idea you may already be pursuing, but which I wish to accelerate if that is the case. In my view, and I have deep information and a direct source for you if you wish, Google is predatory and their data centers will collapse the minute you introduce recyclable individual routers that give CISCO AON capabilities to every single individual, while also offering 10 GB of elective visible storage, and a secure means of sharing excess or available CPUs with what I think of as the World Brain. My lecture to Amazon, standing room only with VPs and PMs coming in-I drew twice as many as the Microsoft and IBM speakers, almost 300-is enclosed in DVD form for your convenience. I also enclose a copy of my planned Gnomedex presentation on 10 August 2007 in Seattle. Finally, I enclose for your private information a one-page memorandum invited from me by the Director of National Intelligence (DNI). I am on the verge of winning a twenty-year battle to keep open source information and multinational information sharing "outside the wire" as a civil affairs function. As a former spy, I know the pathologies of the secret world. If you were to buy Grub, the distributed search service, and also www.telelanguage.com, and perhaps, in partnership with Sun Microsystems (a Google purchase target that should be denied them), also www.silobreaker.com, then the way will be open to offering the world both privacy and security and infinite sharing with embedded all language translation by humans at very low cost, and a suite of Open Analytic and Collaboration Tools. Add all Amazon content available for micro-cash, recruit all authors and artists directly into Amazon, and we have the World Brain. Best wishes, Robert D. Steele (Vivas) Chief Executive Officer -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.wikia.com/pipermail/search-l/attachments/20070803/d40c4a2b/attachment.html From mjesales at gmail.com Fri Aug 3 15:20:24 2007 From: mjesales at gmail.com (MJE Sales, LLC) Date: Fri, 3 Aug 2007 11:20:24 -0400 Subject: [Search-l] Privacy and sharing browsing data (Seth Ford) Message-ID: <4de9d4950708030820p9bf629cs9235c28909bbfa4@mail.gmail.com> This is my first reply to something so If I screwed it up - I'm sorry. I like the idea of having something run in the browser that would know what urls to spider based on our browsing history. Lots of people use the google toolbar, alexa tool bar or compete toolbars - all of which send to the server what website or websites you are at and everything else. A little firefox button that simply logged the urls - or the domains and sent it anonymously to the server - would be a great way of developing an index that had sites that you knew people were actually visiting. If you are looking at hindering spam - there has to be some sort of AI component or a human element. Why not create a stumbleupon type thing where sites are flagged as spam or not spam. to reduce the load on all the servers it could send 10, 25, or 50 urls at a time. But then again I guess each persons definition of spam is a little different. We run several large domains with 100,000's of pages, so our approach is a tad different. Have a Great Day! Life is what you make of it! Matt Ellsworth MJE Sales, LLC 702-953-5733 Skype: mjesales yahoo: mattseo http://www.mjesales.com http://www.articlesnatch.com "The richest people in the world look for and build networks, everyone else looks for work." ~ Robert Kiyosaki RE: Date: Mon, 30 Jul 2007 13:35:19 -0600 From: "Seth Ford" Subject: Re: [Search-l] Privacy and sharing browsing data To: "Jimmy Wales" Cc: search-l at wikia.com Message-ID: Content-Type: text/plain; charset="iso-8859-1" Thats why I think it has to be a mash-up. You have to allow people to look to the community first and then look to the crawl, be it tab based or inline. It's seems people are more interested in participating once the trust they can find the data they are looking for and then given encouragement to participate it organize it in a more reasonable fashion. I have sent out some of the implementation I have done along these lines internally where I work. It does seem like it comes down to a matter of trust, internally it's much easier to do a community powered search engine built of a wiki mashed by a crawl. Externally how do you hinder spam and gaming and foster the sense of identity? Maybe it's a /. like implementation or simply wikipedia is as good as it gets...? Seth On 7/28/07, Jimmy Wales wrote: > > (This was about faroo.com ) > > jer wrote: > > Yeah, noticed them too, completely not open source... > > Yup! But doing this: > > >> "When an user opens a page with the browser, it will be automatically > >> inserted into the distributed index of the p2p network. The > >> additional network load and the site submission of a traditional > >> crawler is omitted. Assuming a wide spread of FAROO this enables an > >> almost complete index, updated in real time." > > Seems pretty easy to do with a simple firefox extension. > > The difficult bit is thinking about user privacy and stopping spam. Let > me explain what I mean: > > When we have a public way for people to submit, tag, and rate urls there > are no particular difficult issues with privacy because when you submit > something, you are doing it publicly and if you want privacy, you'd best > use a pseudonym to login... just like with any wiki. Anyone who is > inserting junk into the index will be quickly detected and blocked or > rated as a spammer, and there you go. > > But simply browsing the web is a different matter. I would not be happy > with having my click stream of what I am surfing made public -- even if > I was using a pseudonym. There are simply too many ways to guess who I > am from my click stream. > > And yet, if no one can see my click stream, then I might just be a > spammer merrily trolling around on my own spamtastic crap site. > > I think there are some clever solutions to this possible. One would be > that my browsing history would never be made public BUT if urls that I > have submitted made it into the index, and people subsequently mark them > as spam, then this fact shows up publicly in the form of a number: "This > user has submitted X urls which were subsequently judged by the > community to be spam." This could be said without revealing what they > were. > > That is just the first thought of how to go about it. > > I am eager to think about way that we can encourage passive > participation by GOOD people who simply believe in our mission, would > like to give us good data on real browsing patterns, but who rightly > value their privacy, while at the same time preventing spammers from > wasting too much of our time. > > --Jimbo From jeremie at jabber.org Fri Aug 3 21:38:27 2007 From: jeremie at jabber.org (jer) Date: Fri, 3 Aug 2007 16:38:27 -0500 Subject: [Search-l] Privacy and sharing browsing data (Seth Ford) In-Reply-To: <4de9d4950708030820p9bf629cs9235c28909bbfa4@mail.gmail.com> References: <4de9d4950708030820p9bf629cs9235c28909bbfa4@mail.gmail.com> Message-ID: <454AECC4-0789-4D89-9AFD-F8D5F164458D@jabber.org> Definitely check out the Attention Trust, these folks are on the right track and I hope that we can become one of the opt-in Attention Services who's goal is to source/weight items for a crawler (with a public crawl output). http://www.attentiontrust.org/ Jer On Aug 3, 2007, at 10:20 AM, MJE Sales, LLC wrote: > This is my first reply to something so If I screwed it up - I'm sorry. > > I like the idea of having something run in the browser that would know > what urls to spider based on our browsing history. Lots of people use > the google toolbar, alexa tool bar or compete toolbars - all of which > send to the server what website or websites you are at and everything > else. > > A little firefox button that simply logged the urls - or the domains > and sent it anonymously to the server - would be a great way of > developing an index that had sites that you knew people were actually > visiting. > > > If you are looking at hindering spam - there has to be some sort of AI > component or a human element. Why not create a stumbleupon type thing > where sites are flagged as spam or not spam. to reduce the load on > all the servers it could send 10, 25, or 50 urls at a time. But then > again I guess each persons definition of spam is a little different. > We run several large domains with 100,000's of pages, so our approach > is a tad different. > > > > Have a Great Day! > > Life is what you make of it! > > Matt Ellsworth > MJE Sales, LLC > 702-953-5733 > Skype: mjesales > yahoo: mattseo > http://www.mjesales.com > http://www.articlesnatch.com > > "The richest people in the world look for and build networks, everyone > else looks for work." ~ Robert Kiyosaki > > RE: Date: Mon, 30 Jul 2007 13:35:19 -0600 > From: "Seth Ford" > Subject: Re: [Search-l] Privacy and sharing browsing data > To: "Jimmy Wales" > Cc: search-l at wikia.com > Message-ID: > > Content-Type: text/plain; charset="iso-8859-1" > > Thats why I think it has to be a mash-up. You have to allow people > to look > to the community first and then look to the crawl, be it tab based or > inline. It's seems people are more interested in participating once > the > trust they can find the data they are looking for and then given > encouragement to participate it organize it in a more reasonable > fashion. I > have sent out some of the implementation I have done along these lines > internally where I work. It does seem like it comes down to a > matter of > trust, internally it's much easier to do a community powered search > engine > built of a wiki mashed by a crawl. Externally how do you hinder > spam and > gaming and foster the sense of identity? Maybe it's a /. like > implementation > or simply wikipedia is as good as it gets...? > Seth > > On 7/28/07, Jimmy Wales wrote: >> >> (This was about faroo.com ) >> >> jer wrote: >>> Yeah, noticed them too, completely not open source... >> >> Yup! But doing this: >> >>>> "When an user opens a page with the browser, it will be >>>> automatically >>>> inserted into the distributed index of the p2p network. The >>>> additional network load and the site submission of a traditional >>>> crawler is omitted. Assuming a wide spread of FAROO this enables an >>>> almost complete index, updated in real time." >> >> Seems pretty easy to do with a simple firefox extension. >> >> The difficult bit is thinking about user privacy and stopping >> spam. Let >> me explain what I mean: >> >> When we have a public way for people to submit, tag, and rate urls >> there >> are no particular difficult issues with privacy because when you >> submit >> something, you are doing it publicly and if you want privacy, >> you'd best >> use a pseudonym to login... just like with any wiki. Anyone who is >> inserting junk into the index will be quickly detected and blocked or >> rated as a spammer, and there you go. >> >> But simply browsing the web is a different matter. I would not be >> happy >> with having my click stream of what I am surfing made public -- >> even if >> I was using a pseudonym. There are simply too many ways to guess >> who I >> am from my click stream. >> >> And yet, if no one can see my click stream, then I might just be a >> spammer merrily trolling around on my own spamtastic crap site. >> >> I think there are some clever solutions to this possible. One >> would be >> that my browsing history would never be made public BUT if urls >> that I >> have submitted made it into the index, and people subsequently >> mark them >> as spam, then this fact shows up publicly in the form of a number: >> "This >> user has submitted X urls which were subsequently judged by the >> community to be spam." This could be said without revealing what >> they >> were. >> >> That is just the first thought of how to go about it. >> >> I am eager to think about way that we can encourage passive >> participation by GOOD people who simply believe in our mission, would >> like to give us good data on real browsing patterns, but who rightly >> value their privacy, while at the same time preventing spammers from >> wasting too much of our time. >> >> --Jimbo > _______________________________________________ > Search-l mailing list > Search-l at wikia.com > http://lists.wikia.com/mailman/listinfo/search-l > Change options or unsubscribe: http://lists.wikia.com/mailman/ > options/search-l From mjesales at gmail.com Sat Aug 4 00:18:09 2007 From: mjesales at gmail.com (MJE Sales, LLC) Date: Fri, 3 Aug 2007 20:18:09 -0400 Subject: [Search-l] Privacy and sharing browsing data (Seth Ford) In-Reply-To: <454AECC4-0789-4D89-9AFD-F8D5F164458D@jabber.org> References: <4de9d4950708030820p9bf629cs9235c28909bbfa4@mail.gmail.com> <454AECC4-0789-4D89-9AFD-F8D5F164458D@jabber.org> Message-ID: <4de9d4950708031718j40712b3cid6fcd6747c4fc853@mail.gmail.com> Only problem with that it seems is that the services they recomend there - none of them still exist - at least I can't find them... The last post on this site was in March 2007. Is this thing even still around? Matt Ellsworth MJE Sales, LLC On 8/3/07, jer wrote: > Definitely check out the Attention Trust, these folks are on the > right track and I hope that we can become one of the opt-in Attention > Services who's goal is to source/weight items for a crawler (with a > public crawl output). > > http://www.attentiontrust.org/ > > Jer > > On Aug 3, 2007, at 10:20 AM, MJE Sales, LLC wrote: > > > This is my first reply to something so If I screwed it up - I'm sorry. > > > > I like the idea of having something run in the browser that would know > > what urls to spider based on our browsing history. Lots of people use > > the google toolbar, alexa tool bar or compete toolbars - all of which > > send to the server what website or websites you are at and everything > > else. > > > > A little firefox button that simply logged the urls - or the domains > > and sent it anonymously to the server - would be a great way of > > developing an index that had sites that you knew people were actually > > visiting. > > > > > > If you are looking at hindering spam - there has to be some sort of AI > > component or a human element. Why not create a stumbleupon type thing > > where sites are flagged as spam or not spam. to reduce the load on > > all the servers it could send 10, 25, or 50 urls at a time. But then > > again I guess each persons definition of spam is a little different. > > We run several large domains with 100,000's of pages, so our approach > > is a tad different. > > > > > > > > Have a Great Day! > > > > Life is what you make of it! > > > > Matt Ellsworth > > MJE Sales, LLC > > 702-953-5733 > > Skype: mjesales > > yahoo: mattseo > > http://www.mjesales.com > > http://www.articlesnatch.com > > > > "The richest people in the world look for and build networks, everyone > > else looks for work." ~ Robert Kiyosaki > > > > RE: Date: Mon, 30 Jul 2007 13:35:19 -0600 > > From: "Seth Ford" > > Subject: Re: [Search-l] Privacy and sharing browsing data > > To: "Jimmy Wales" > > Cc: search-l at wikia.com > > Message-ID: > > > > Content-Type: text/plain; charset="iso-8859-1" > > > > Thats why I think it has to be a mash-up. You have to allow people > > to look > > to the community first and then look to the crawl, be it tab based or > > inline. It's seems people are more interested in participating once > > the > > trust they can find the data they are looking for and then given > > encouragement to participate it organize it in a more reasonable > > fashion. I > > have sent out some of the implementation I have done along these lines > > internally where I work. It does seem like it comes down to a > > matter of > > trust, internally it's much easier to do a community powered search > > engine > > built of a wiki mashed by a crawl. Externally how do you hinder > > spam and > > gaming and foster the sense of identity? Maybe it's a /. like > > implementation > > or simply wikipedia is as good as it gets...? > > Seth > > > > On 7/28/07, Jimmy Wales wrote: > >> > >> (This was about faroo.com ) > >> > >> jer wrote: > >>> Yeah, noticed them too, completely not open source... > >> > >> Yup! But doing this: > >> > >>>> "When an user opens a page with the browser, it will be > >>>> automatically > >>>> inserted into the distributed index of the p2p network. The > >>>> additional network load and the site submission of a traditional > >>>> crawler is omitted. Assuming a wide spread of FAROO this enables an > >>>> almost complete index, updated in real time." > >> > >> Seems pretty easy to do with a simple firefox extension. > >> > >> The difficult bit is thinking about user privacy and stopping > >> spam. Let > >> me explain what I mean: > >> > >> When we have a public way for people to submit, tag, and rate urls > >> there > >> are no particular difficult issues with privacy because when you > >> submit > >> something, you are doing it publicly and if you want privacy, > >> you'd best > >> use a pseudonym to login... just like with any wiki. Anyone who is > >> inserting junk into the index will be quickly detected and blocked or > >> rated as a spammer, and there you go. > >> > >> But simply browsing the web is a different matter. I would not be > >> happy > >> with having my click stream of what I am surfing made public -- > >> even if > >> I was using a pseudonym. There are simply too many ways to guess > >> who I > >> am from my click stream. > >> > >> And yet, if no one can see my click stream, then I might just be a > >> spammer merrily trolling around on my own spamtastic crap site. > >> > >> I think there are some clever solutions to this possible. One > >> would be > >> that my browsing history would never be made public BUT if urls > >> that I > >> have submitted made it into the index, and people subsequently > >> mark them > >> as spam, then this fact shows up publicly in the form of a number: > >> "This > >> user has submitted X urls which were subsequently judged by the > >> community to be spam." This could be said without revealing what > >> they > >> were. > >> > >> That is just the first thought of how to go about it. > >> > >> I am eager to think about way that we can encourage passive > >> participation by GOOD people who simply believe in our mission, would > >> like to give us good data on real browsing patterns, but who rightly > >> value their privacy, while at the same time preventing spammers from > >> wasting too much of our time. > >> > >> --Jimbo > > _______________________________________________ > > Search-l mailing list > > Search-l at wikia.com > > http://lists.wikia.com/mailman/listinfo/search-l > > Change options or unsubscribe: http://lists.wikia.com/mailman/ > > options/search-l > > From jmcc at hackwatch.com Sun Aug 5 20:39:43 2007 From: jmcc at hackwatch.com (John McCormac) Date: Sun, 05 Aug 2007 21:39:43 +0100 Subject: [Search-l] Privacy and sharing browsing data (Seth Ford) In-Reply-To: <4de9d4950708030820p9bf629cs9235c28909bbfa4@mail.gmail.com> References: <4de9d4950708030820p9bf629cs9235c28909bbfa4@mail.gmail.com> Message-ID: <46B6358F.7090407@hackwatch.com> MJE Sales, LLC wrote: > If you are looking at hindering spam - there has to be some sort of AI > component or a human element. Why not create a stumbleupon type thing > where sites are flagged as spam or not spam. to reduce the load on > all the servers it could send 10, 25, or 50 urls at a time. But then > again I guess each persons definition of spam is a little different. > We run several large domains with 100,000's of pages, so our approach > is a tad different. Would this solution scale well? The human element has almost always been the weak link in fighting search engine spam. The index quality part of running a search engine is always the toughest part of the task. Even allocating 50 or so urls at a time would require, at a rough guess, hundreds of thousands of active users. It would effectively require some form of Wikia toolbar where usage and voting data would be fed back to the central site. This adds yet another layer of complexity on an as yet non-existent layercake search project. And then there is the reluctance of users to install yet another piece of cruft into their browser after the Google toolbar, the Microsoft toolbar, the Yahoo toolbar etc. Regards...jmcc -- ****************************************************** John McCormac * e-mail: jmcc at whoisireland.com MC2 * voice: +353-51-873640 22 Viewmount * web: http://www.whoisireland.com/ Waterford * blog: http://blog.whoisireland.com Ireland * Irish Domain Stats & Market Research ****************************************************** From jmcc at hackwatch.com Sun Aug 5 21:53:06 2007 From: jmcc at hackwatch.com (John McCormac) Date: Sun, 05 Aug 2007 22:53:06 +0100 Subject: [Search-l] What Is Wikia and How Real Is It? Message-ID: <46B646C2.4040800@hackwatch.com> (I've just spent the last day or so checking approximately 2.5M .uk domains and websites so apologies in advance if this appears as a bit of a rant.) What exactly is Wikia? Is it a search engine or just a form of hybrid super directory based on Wikipedia and similar sites? So far there has been nothing real about this project beyond what is effectively just another wiki about the idea of a search engine. Talking about search engine development is all well and good but there has to be real development taking place. Otherwise this will just end up as another vapourware search engine like Dipsie and so many others. What worries me about this whole project is that the people at the top, despite having a lot of other experience, have no recognisable search engine development experience (from what I've seen). Having the search engine software is only a small part of the project. It is like having a jet without a pilot or fuel. Though with Grub, it is not so much a jet aircraft as a bicycle with the handbars and wheels missing. Perhaps Wikia has a few people who have been hired who actually have done more than just read about search engine development. If Wikia is to be a Google Killer then it has to have such real operators rather than toy soldiers. Google, Microsoft and Yahoo all take search deadly seriously and have a lot of research going on. They will not concede marketshare gracefully. This is why I think that talk of Wikia being a Google Killer is marketing hype when compared to the reality of these search engines. Looksmart (the company that bought Grub and sold (?) it again to Wikia) got slaughtered by these players. I'd like to see a lot more diversity in the search market. Hopefully Wikia has a chance of providing some. How may websites does Wikia expect to spider? Does it have a clear idea of how many websites are out there on the net? How will it deal with the languages issue or is it going to be a universally English language operation? This talk of Open Source and opening up search is all very nice. The whole opening up of search engine algorithms is quite ridiculous because once they are open, it becomes easy to game the search engine. (This would be an argument for the hybrid quality controlling effect of a wikipedia type approach.) Google found this out with its Page Rank and other search engines had similar problems. The gaming of the algorithms is one of the major problems that the large search engines spend a lot of time trying to solve. They continually tweak their algorithms to make them more effective and less prone to gaming. Making the data available is also a bit of a red herring. Just how much data do the people in Wikia think would result from a full crawl of the web? Or just taking some of the minor gTLDs - how much data would result from a crawl of .info or .biz? How would webmasters react to their data being made freely available like this? Every spam scraper and MFA (Made For Adsense) plagiarist would have a field day with this data. But the sheer size of the data resulting from a full or even a limited crawl makes the whole downloadable aspect highly questionable. Google and the other major players have thousands of servers. Unless Wikia is to be some P2P type engine, it will take a lot more investment to make Wikia a viable threat to the main players. Who benefits? What makes Wikia different? How close to realisation is it? Regards...jmcc -- ****************************************************** John McCormac * e-mail: jmcc at whoisireland.com MC2 * voice: +353-51-873640 22 Viewmount * web: http://www.whoisireland.com/ Waterford * blog: http://blog.whoisireland.com Ireland * Irish Domain Stats & Market Research ****************************************************** From jeremie at jabber.org Mon Aug 6 03:18:52 2007 From: jeremie at jabber.org (jer) Date: Sun, 5 Aug 2007 22:18:52 -0500 Subject: [Search-l] What Is Wikia and How Real Is It? In-Reply-To: <46B646C2.4040800@hackwatch.com> References: <46B646C2.4040800@hackwatch.com> Message-ID: John, I can understand your pessimism when looking at what we're doing as trying to be a "Google Killer" but we're on different wavelengths. We are not building yet-another-search-engine, we are putting our efforts into making building ANY search engine easier, better tools, better methods, more shared systems, etc. This isn't one project, it's tens or even hundreds of them, and likely to take years. Clearly you're neck-deep in search development yourself, so you would have a great opinion on what kinds of tools and resources would make your life easier, do you have any suggestions? Jer On Aug 5, 2007, at 4:53 PM, John McCormac wrote: > (I've just spent the last day or so checking approximately 2.5M .uk > domains and websites so apologies in advance if this appears as a > bit of > a rant.) > > What exactly is Wikia? Is it a search engine or just a form of hybrid > super directory based on Wikipedia and similar sites? > > So far there has been nothing real about this project beyond what is > effectively just another wiki about the idea of a search engine. > Talking > about search engine development is all well and good but there has > to be > real development taking place. Otherwise this will just end up as > another vapourware search engine like Dipsie and so many others. > > What worries me about this whole project is that the people at the > top, > despite having a lot of other experience, have no recognisable search > engine development experience (from what I've seen). Having the search > engine software is only a small part of the project. It is like > having a > jet without a pilot or fuel. Though with Grub, it is not so much a jet > aircraft as a bicycle with the handbars and wheels missing. > > Perhaps Wikia has a few people who have been hired who actually have > done more than just read about search engine development. If Wikia > is to > be a Google Killer then it has to have such real operators rather than > toy soldiers. Google, Microsoft and Yahoo all take search deadly > seriously and have a lot of research going on. They will not concede > marketshare gracefully. This is why I think that talk of Wikia being a > Google Killer is marketing hype when compared to the reality of these > search engines. Looksmart (the company that bought Grub and sold > (?) it > again to Wikia) got slaughtered by these players. > > I'd like to see a lot more diversity in the search market. Hopefully > Wikia has a chance of providing some. How may websites does Wikia > expect > to spider? Does it have a clear idea of how many websites are out > there > on the net? How will it deal with the languages issue or is it > going to > be a universally English language operation? > > This talk of Open Source and opening up search is all very nice. The > whole opening up of search engine algorithms is quite ridiculous > because > once they are open, it becomes easy to game the search engine. (This > would be an argument for the hybrid quality controlling effect of a > wikipedia type approach.) Google found this out with its Page Rank and > other search engines had similar problems. The gaming of the > algorithms > is one of the major problems that the large search engines spend a lot > of time trying to solve. They continually tweak their algorithms to > make > them more effective and less prone to gaming. > > Making the data available is also a bit of a red herring. Just how > much > data do the people in Wikia think would result from a full crawl of > the > web? Or just taking some of the minor gTLDs - how much data would > result > from a crawl of .info or .biz? > > How would webmasters react to their data being made freely available > like this? Every spam scraper and MFA (Made For Adsense) plagiarist > would have a field day with this data. But the sheer size of the data > resulting from a full or even a limited crawl makes the whole > downloadable aspect highly questionable. Google and the other major > players have thousands of servers. Unless Wikia is to be some P2P type > engine, it will take a lot more investment to make Wikia a viable > threat > to the main players. > > Who benefits? > What makes Wikia different? > How close to realisation is it? > > Regards...jmcc > -- > ****************************************************** > John McCormac * e-mail: jmcc at whoisireland.com > MC2 * voice: +353-51-873640 > 22 Viewmount * web: http://www.whoisireland.com/ > Waterford * blog: http://blog.whoisireland.com > Ireland * Irish Domain Stats & Market Research > ****************************************************** > _______________________________________________ > Search-l mailing list > Search-l at wikia.com > http://lists.wikia.com/mailman/listinfo/search-l > Change options or unsubscribe: http://lists.wikia.com/mailman/ > options/search-l From jeremie at jabber.org Mon Aug 6 12:52:18 2007 From: jeremie at jabber.org (jer) Date: Mon, 6 Aug 2007 07:52:18 -0500 Subject: [Search-l] Text Categorization Project (thanks Intellisophic!) Message-ID: <8354339D-C233-4BAB-9AF0-F3D28E50ED46@jabber.org> The folks over at Intellisophic (http://www.intellisophic.com/) have taken a big step and open-sourced their core text-categorization engine. Some PR went out about it already (http://www.prweb.com/ releases/2007/8/prweb544628.htm) and this week they'll be introducing us to the codebase both here on the list and on the wiki and swlabs.org dev servers. This new project around their engine is just one component to search, and very useful outside of search as well. For those that aren't familiar with text categorization in general, this is a good starting point: http://en.wikipedia.org/wiki/Document_classification As with everything else, it will take some time and nurturing to really see the fruits of this project, but it's an important piece to the puzzle and I thank Intellisophic for taking this big step and contributing to the larger open vision of the future. Jer From jmcc at hackwatch.com Mon Aug 6 13:02:14 2007 From: jmcc at hackwatch.com (John McCormac) Date: Mon, 06 Aug 2007 14:02:14 +0100 Subject: [Search-l] What Is Wikia and How Real Is It? In-Reply-To: References: <46B646C2.4040800@hackwatch.com> Message-ID: <46B71BD6.3080101@hackwatch.com> jer wrote: > John, I can understand your pessimism when looking at what we're doing > as trying to be a "Google Killer" but we're on different wavelengths. Well some people seem to think that I am on a different planet entirely. :) The venture is being portrayed as a Google Killer in the media coverage and spin. The problem is that there is no actual basis for such a claim other than it gives the media a nice soundbite and keeps the investors happy. > We are not building yet-another-search-engine, we are putting our > efforts into making building ANY search engine easier, better tools, > better methods, more shared systems, etc. This isn't one project, it's > tens or even hundreds of them, and likely to take years. So if I read this right, there is no search engine? It is just an idea for a platform that is scalable and can be used for search engine development? But without knowing the processing requirements, the storage requirements and the bandwidth requirements, it is difficult to design such a platform. > Clearly you're neck-deep in search development yourself, so you would > have a great opinion on what kinds of tools and resources would make > your life easier, do you have any suggestions? Ideally, the best resource would be more time. By comparison everything else pales. The holy trinity of search is bandwidth, hardware and software. The bandwidth required to spider tens of millions of websites on an ongoing basis is considerable. Therefore such a venture would need a lot of available bandwidth. The hardware is also a very significant requirement. It would need a lot of servers to do a proper crawl of the web. It would also require a backend to process the resulting data into something usable. And a search interface would be required. The software aspect is perhaps somewhat easier as the task can be clearly defined. It has to be scalable, fast and provide good results. However that is a massive simplification. There are some good Open Source products out there that do the job well. Nutch is one of the most popular products in this respect. It also has the elements of scalability required for large indices. And the tools to work on the resultant data are well developed and supported. Most of the work will be on the resulting data. The search index is the hard part. It takes a long time to develop a good, clean index. The Infinite Monkeys approach to building an index (following links and hoping that they will lead to new pages) is not the most efficient method of building an index quickly when any of the prior requirements are absent or deficient. A good index makes the difference between a great search engine and a spam infested pile of junk. I'm not convinced that the Wikia people quite appreciate the level of work that goes into that aspect of developing a search engine. Crawling a clearly defined index such as that of Wikipedia or some other silo site is easy. However crawling the web is like trying to take a slice of a swirling nebula. It isn't really a question of what we want. It is more a question of what the Wikia project can provide to make the task of developing a search engine easier. Developing a viable search index is the hardest task of all - the other elements (the hardware, the bandwidth and the software) can be acquired to some extent. So what exactly can Wikia offer? Bandwidth? Hardware? Expertise? Can you give us some descriptions and specifications of the resources and expertise that is available to search engine developers? For most of us, we have to deal with the realities imposed by hardware and bandwidth limitations. We don't have the luxury of just theorising - everything we do is geared towards survival in a highly competitive market. Perhaps we SE people really are on a different wavelength to the Wikia people. Perhaps the question foremost in the minds of many of the SE people on this list is this: why should be provide the search expertise? Or, to put it less diplomatically, why should we make you rich? Regards...jmcc -- ****************************************************** John McCormac * e-mail: jmcc at whoisireland.com MC2 * voice: +353-51-873640 22 Viewmount * web: http://www.whoisireland.com/ Waterford * blog: http://blog.whoisireland.com Ireland * Irish Domain Stats & Market Research ****************************************************** From jeremie at jabber.org Mon Aug 6 13:21:29 2007 From: jeremie at jabber.org (jer) Date: Mon, 6 Aug 2007 08:21:29 -0500 Subject: [Search-l] Grub Update In-Reply-To: <46B24A0D.3000606@hackwatch.com> References: <46AFCBCD.8010502@hackwatch.com> <46AFE229.1020409@wikia.com> <46AFF132.20400@hackwatch.com> <4B78A6C2-A0C5-4614-B76B-CFE3E5A8B0DF@jabber.org> <46B24A0D.3000606@hackwatch.com> Message-ID: <57078C94-CE4D-4D0A-AEBC-BF7B4232898E@jabber.org> > Yes Jer, but you don't know what I've done and vice versa. So I > don't know if we have been through quite the same process. :) In 05 & 06 I built a web search as part of a private R&D project (various iterations had between 100M and 1B pages), it's when I really got upset about the state of the whole web search industry and realized that building an open foundation right now will make a tremendous impact in the next 5-10 years. > The thing about search on this or any other significant scale is > that it requires a completely different mindset to that required > for building a web directory or wiki where each entry can be > individually validated. I don't think that adding a 'crap-ton' of > human intelligence to the process is an accurate description of > what happens. > > The indications that mark a site for deletion tend to be clear and > it is the speed on which this happens that is important. Sometimes, > this has to be applied to every website on an IP or even on the > same DNS. It is a very anti-democratic process. Some are easy wins > - linkswamps that can be identified by a DNS or IP. PPC that can be > identified from a particular string, duplicate content pages that > all have the same MD5 hash etc. The hard part is when it goes > beyond the easy wins to the stuff that requires a human decision. That's why it's an open source and human platform, each do their own parts as best they can, it's not all one or all the other. > Some of us (those lucky enough to survive in the search engine > wars) have been doing this kind of work independently for years. > We do talk to each other but there is a slight attitude of "better > him than me" when some other search engine venture goes dot.bomb. > Some of the techniques and methodology of search engine development > are closely held - none more closely than a good search index. The > tools for building search engines are widely available (Nutch etc). > It is the human element of the equation that is in short supply. > > Many on the second and third tiers (those below GYM (Google/Yahoo/ > Microsoft)) of the search business have been talking on internet > fora and lists for years. Having spent years developing good a > search index, many of these people would not particularly want to > give up such an edge. Though the wiki idea is nice, the mindset is > somewhat different to that of Wikipedia and the whole "Cathederal > and the Bazzar" model. Most search engine developers are too busy > trying to survive without having to subscribe to some happy-clappy > ethos that could very well put them out of business. These are the > guys who you will have to convince that there is some value to > being involved in the Wikia search project. I don't need to convince anyone, I'll build something I believe in and build/share it openly, and if others share in the vision and passion then they are welcome to participate. > That's all very laudable but this is a business. The small search > engines are not going to hand over their survival edge to Jimmy's > vision, which is essentially that of a competitor who will take > their work an monetise it. That is the road block that the project > has to get beyond. It's an inefficient and closed business, and that will change. Survival may depend on participating and collaborating much more openly than it's done today in everyone's search silo. > But without that essential spark of the search engine developers, > there is a danger that the project could just be another platform - > much like Amazon's search and servers product. Being a search > engine developer is not the same as being a webdeveloper. There is > a lot more thinking and learning involved. Most thinking is about > the "searching for what" question. It defines the nature of the > search engine being developed. It makes the search engine a macro > search engine or a niche engine. It makes the difference between > success and failure. > > Having a platform for open search is nice. It might attract some > search engine developers. Having a real search idea to go with that > platform is better. Is Wikia search just an open platform without > an idea for a search application? I think I answered this in the other email, we're here because we want to move *all* search forward using open and social value systems. There should be many sparks, many applications, that are much easier and faster to build atop an open platform with lots of free tools and resources. Jer From jeremie at jabber.org Mon Aug 6 13:23:06 2007 From: jeremie at jabber.org (jer) Date: Mon, 6 Aug 2007 08:23:06 -0500 Subject: [Search-l] Grub Update In-Reply-To: <46B27C6F.9030204@hackwatch.com> References: <46AFCBCD.8010502@hackwatch.com> <46AFE229.1020409@wikia.com> <46AFF132.20400@hackwatch.com> <46B11A4E.7080903@gmail.com> <5533B20B-A211-4621-ADED-11B615F2E905@jabber.org> <46B25305.7020601@hackwatch.com> <46B269C9.8070206@gmail.com> <46B276AA.7010607@wikia.com> <46B27C6F.9030204@hackwatch.com> Message-ID: <377640E2-9E0F-4DA0-BCA5-AAF28411A7E8@jabber.org> > Well there are a lot of possibilities. The trick is chosing the right > one and turning into reality. The real trick is choosing one, and then another, and another, until you find something that works. Jer From jwales at wikia.com Mon Aug 6 13:25:40 2007 From: jwales at wikia.com (Jimmy Wales) Date: Mon, 06 Aug 2007 06:25:40 -0700 Subject: [Search-l] What Is Wikia and How Real Is It? In-Reply-To: <46B71BD6.3080101@hackwatch.com> References: <46B646C2.4040800@hackwatch.com> <46B71BD6.3080101@hackwatch.com> Message-ID: <46B72154.3080107@wikia.com> John McCormac wrote: > The venture is being portrayed as a Google Killer in the media coverage > and spin. The problem is that there is no actual basis for such a claim > other than it gives the media a nice soundbite and keeps the investors > happy. Actually, I think it makes the investors wonder what kind of lunatic I am. :) We are trying to downplay the "google killer" story line, but it is a great story line, and so the media runs with it anyway. You would get the same story lines about RedHat and Microsoft a few years ago. It's an interesting story, but has little relationship to getting some work done. > So if I read this right, there is no search engine? There currently is no search engine. This is a project to build one, but more importantly, to build this: > It is just an idea for a platform that is scalable and can be used for > search engine development? But without knowing the processing > requirements, the storage requirements and the bandwidth requirements, > it is difficult to design such a platform. Figuring out those things is part of the process, yes? > The bandwidth required to spider tens of millions of websites on an > ongoing basis is considerable. Therefore such a venture would need a lot > of available bandwidth. > > The hardware is also a very significant requirement. It would need a lot > of servers to do a proper crawl of the web. It would also require a > backend to process the resulting data into something usable. And a > search interface would be required. Yes, so that matches my own very scientific estimates. "a lot of bandwidth" and a "lot of servers". :) > The search index is the hard part. It takes a long time to develop a > good, clean index. The Infinite Monkeys approach to building an index > (following links and hoping that they will lead to new pages) is not the > most efficient method of building an index quickly when any of the prior > requirements are absent or deficient. I absolutely agree with that. I don't think anyone is proposing an Infinite Monkey approach to spidering. > A good index makes the difference between a great search engine and a > spam infested pile of junk. I'm not convinced that the Wikia people > quite appreciate the level of work that goes into that aspect of > developing a search engine. Crawling a clearly defined index such as > that of Wikipedia or some other silo site is easy. However crawling the > web is like trying to take a slice of a swirling nebula. Would it help if I say that I *do* appreciate the level of work that goes into that aspect of things? Not sure what you are looking for here. The task at the moment for me is to design the social aspect of the community part of the site. The goal is to have good tools to allow the community to control the crawl in intelligent ways. This is not Infinite Monkeys, and it has to deal with interesting questions about self-interested editors, trust, etc. > So what exactly can Wikia offer? Bandwidth? Hardware? Expertise? Can you > give us some descriptions and specifications of the resources and > expertise that is available to search engine developers? For most of us, > we have to deal with the realities imposed by hardware and bandwidth > limitations. We don't have the luxury of just theorising - everything we > do is geared towards survival in a highly competitive market. Perhaps we > SE people really are on a different wavelength to the Wikia people. Well, we do have the luxury of being able to provide hardware and bandwidth to the community. So we don't have to cut corners in those areas. > Perhaps the question foremost in the minds of many of the SE people on > this list is this: why should be provide the search expertise? Or, to > put it less diplomatically, why should we make you rich? I am not asking you to make me rich. If you don't want to participate, then don't. If you think you can go out on your own and build a proprietary search engine that makes you money, go ahead. If you think that you could find it useful to work with a broader community to leverage each others talents so that in whatever you are doing (enterprise search? niche search on the web? social search?), there is a chance for you to compete with the big players on a much more level playing field, then come and help us. If you want us to build it for you, for free, giving it all to you and asking for nothing in return, then... well, that's fine too. :) That's what we do. --Jimbo From seth.ford at gmail.com Mon Aug 6 14:42:06 2007 From: seth.ford at gmail.com (Seth Ford) Date: Mon, 6 Aug 2007 08:42:06 -0600 Subject: [Search-l] 10 People Powered Search Engines Message-ID: Saw these articles http://mashable.com/2007/06/01/10-people-powered-search-engines/ and http://sfgate.com/cgi-bin/article.cgi?f=/c/a/2006/09/04/BUGVCKTPOP1.DTL I completely agree with Rob Enderle... "You can't build a better Google. You have to approach this market differently". Personally I love the way http://mahalo.com/ is tackling the problem... they just need a bit more work. But the mashup for people powered wiki content and Google search for the tail that doesn't exist is the right one. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.wikia.com/pipermail/search-l/attachments/20070806/a6158363/attachment.html From jwales at wikia.com Mon Aug 6 14:49:12 2007 From: jwales at wikia.com (Jimmy Wales) Date: Mon, 06 Aug 2007 07:49:12 -0700 Subject: [Search-l] 10 People Powered Search Engines In-Reply-To: References: Message-ID: <46B734E8.8040403@wikia.com> Seth Ford wrote: > Saw these articles > http://mashable.com/2007/06/01/10-people-powered-search-engines/ and > http://sfgate.com/cgi-bin/article.cgi?f=/c/a/2006/09/04/BUGVCKTPOP1.DTL > I completely agree with Rob Enderle... "You can't build a better Google. > You have to approach this market differently". Personally I love the way > http://mahalo.com/ is tackling the problem... they just need a bit more > work. But the mashup for people powered wiki content and Google search > for the tail that doesn't exist is the right one. I agree with the general concept of "let humans do what humans do well, let computers do what computers do well" and I also think this roughly translates to "the head and the long tail"... But what Mahalo is doing is totally uninteresting to me because it is proprietary. It doesn't change the structure of the industry. From jmcc at hackwatch.com Mon Aug 6 14:51:35 2007 From: jmcc at hackwatch.com (John McCormac) Date: Mon, 06 Aug 2007 15:51:35 +0100 Subject: [Search-l] What Is Wikia and How Real Is It? In-Reply-To: <46B72154.3080107@wikia.com> References: <46B646C2.4040800@hackwatch.com> <46B71BD6.3080101@hackwatch.com> <46B72154.3080107@wikia.com> Message-ID: <46B73577.5020500@hackwatch.com> Jimmy Wales wrote: > You would get the same story lines about RedHat and Microsoft a few > years ago. It's an interesting story, but has little relationship to > getting some work done. Yes but Redhat had an active developer community and Microsoft had a near monopoly on the desktop market. Redhat has carved out a niche for itself and Microsoft is trying to diversify. > The task at the moment for me is to design the social aspect of the > community part of the site. The goal is to have good tools to allow the > community to control the crawl in intelligent ways. This is not > Infinite Monkeys, and it has to deal with interesting questions about > self-interested editors, trust, etc. So the social element will be developed first? That's probably the strongest aspect of Wikia since it may result in a better index. > Well, we do have the luxury of being able to provide hardware and > bandwidth to the community. So we don't have to cut corners in those > areas. Yes but developers need specifications. Can you at least give us an indication as to the servers and bandwidth available? > If you think you can go out on your own and build a proprietary search > engine that makes you money, go ahead. If you think that you could find SE developers (those outside the big players) tend to think like that. Where they cannot compete with the big players at their level, they will identify niches and areas where the big players are weak and work accordingly. > it useful to work with a broader community to leverage each others > talents so that in whatever you are doing (enterprise search? niche > search on the web? social search?), there is a chance for you to compete > with the big players on a much more level playing field, then come and > help us. Working with a broader community would be nice. But what you have to understand is that we have to make a living. Therefore we might take the search business a bit more seriously when it comes to planning and implementation. Those of us who have survived in the business this long will have seen others, both friends and competitors, fall by the wayside and tend to be cynical when grandiose Google Killer claims are made. So you will have to forgive us if we ask so many questions. :) Regards...jmcc -- ****************************************************** John McCormac * e-mail: jmcc at whoisireland.com MC2 * voice: +353-51-873640 22 Viewmount * web: http://www.whoisireland.com/ Waterford * blog: http://blog.whoisireland.com Ireland * Irish Domain Stats & Market Research ****************************************************** From jwales at wikia.com Mon Aug 6 14:51:17 2007 From: jwales at wikia.com (Jimmy Wales) Date: Mon, 06 Aug 2007 07:51:17 -0700 Subject: [Search-l] What Is Wikia and How Real Is It? In-Reply-To: <46B73577.5020500@hackwatch.com> References: <46B646C2.4040800@hackwatch.com> <46B71BD6.3080101@hackwatch.com> <46B72154.3080107@wikia.com> <46B73577.5020500@hackwatch.com> Message-ID: <46B73565.2020802@wikia.com> John McCormac wrote: > So the social element will be developed first? That's probably the > strongest aspect of Wikia since it may result in a better index. We'll do as much as we can as fast as we can. :) > SE developers (those outside the big players) tend to think like that. > Where they cannot compete with the big players at their level, they will > identify niches and areas where the big players are weak and work > accordingly. That's totally sensible, and I also think that such developers will be excited and interested about strenghtening the tools available to them for that type of work. > Working with a broader community would be nice. But what you have to > understand is that we have to make a living. Therefore we might take the > search business a bit more seriously when it comes to planning and > implementation. Those of us who have survived in the business this long > will have seen others, both friends and competitors, fall by the wayside > and tend to be cynical when grandiose Google Killer claims are made. So > you will have to forgive us if we ask so many questions. :) I love questions. :) And I love people making a living, nothing wrong with that. --Jimbo From andreengels at gmail.com Mon Aug 6 15:04:07 2007 From: andreengels at gmail.com (Andre Engels) Date: Mon, 6 Aug 2007 17:04:07 +0200 Subject: [Search-l] 10 People Powered Search Engines In-Reply-To: <6faf39c90708060803g33f91d3s77c5ece3a93fe832@mail.gmail.com> References: <46B734E8.8040403@wikia.com> <6faf39c90708060803g33f91d3s77c5ece3a93fe832@mail.gmail.com> Message-ID: <6faf39c90708060804l31d55f61udf83c356efb7bfbc@mail.gmail.com> 2007/8/6, Jimmy Wales : > Seth Ford wrote: > > Saw these articles > > http://mashable.com/2007/06/01/10-people-powered-search-engines/ and > > http://sfgate.com/cgi-bin/article.cgi?f=/c/a/2006/09/04/BUGVCKTPOP1.DTL > > I completely agree with Rob Enderle... "You can't build a better Google. > > You have to approach this market differently". Personally I love the way > > http://mahalo.com/ is tackling the problem... they just need a bit more > > work. But the mashup for people powered wiki content and Google search > > for the tail that doesn't exist is the right one. > > I agree with the general concept of "let humans do what humans do well, > let computers do what computers do well" and I also think this roughly > translates to "the head and the long tail"... > > But what Mahalo is doing is totally uninteresting to me because it is > proprietary. It doesn't change the structure of the industry. It might be uninteresting in the sense of "someone to work with", but it still looks good as an idea of how it could work. I myself would like to integrate search engine and pages more - like showing the first search results as far as not yet on the page with the page itself. That would help users in finding new material to add to the page, and it could also improve the search engine if they could give judgements with reasons to the results (like "bring this result down because it is about another subject/is just advertisement/contains less information than already is on the page/is about a narrow sub-subject only...") -- Andre Engels, andreengels at gmail.com ICQ: 6260644 -- Skype: a_engels From seth.ford at gmail.com Mon Aug 6 15:18:19 2007 From: seth.ford at gmail.com (Seth Ford) Date: Mon, 6 Aug 2007 09:18:19 -0600 Subject: [Search-l] Fwd: 10 People Powered Search Engines In-Reply-To: References: <46B734E8.8040403@wikia.com> Message-ID: But that is easy enough to change... First start with Mediawiki as the base (open it up to edit everyone, but pay for content editors/guides), as they did, but have yet to open it up. Add in a better internal search product that can index the mediawiki database, like lucence (still looks like they haven't done optimized it yet) which can take advantage of things like popularity, categorization and time (vs. a stupid web based crawl). Then mash it with something like a Google, if you prefer Grub fine... But knowing how much time is spent on the index optimization and the fact that community will carry the day at the end I think it's a bit pointless. Pointless from the perspective that you can spend all of your money and time there and just get a mediocre index having wasted a lot of money vs. focusing on the community first and dropping in a search provider after the fact. In 5 years with a strong enough community nobody is going to care who is taking care of the long tail for you, the only thing they will care about is the page rank algorithm that you internal search engine uses and that's easy enough to give away as it should be open for discussion. As for mahalo, I think they have the UI design for a community powered search engine pretty close to being right. But you are correct, it isn't interesting because it isn't open. I think that's all we are asking you guys to do. Seth On 8/6/07, Jimmy Wales wrote: > > Seth Ford wrote: > > Saw these articles > > http://mashable.com/2007/06/01/10-people-powered-search-engines/ and > > http://sfgate.com/cgi-bin/article.cgi?f=/c/a/2006/09/04/BUGVCKTPOP1.DTL > > I completely agree with Rob Enderle... "You can't build a better Google. > > You have to approach this market differently". Personally I love the way > > > http://mahalo.com/ is tackling the problem... they just need a bit more > > work. But the mashup for people powered wiki content and Google search > > for the tail that doesn't exist is the right one. > > I agree with the general concept of "let humans do what humans do well, > let computers do what computers do well" and I also think this roughly > translates to "the head and the long tail"... > > But what Mahalo is doing is totally uninteresting to me because it is > proprietary. It doesn't change the structure of the industry. > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.wikia.com/pipermail/search-l/attachments/20070806/83528a50/attachment.html From jmcc at hackwatch.com Mon Aug 6 15:48:19 2007 From: jmcc at hackwatch.com (John McCormac) Date: Mon, 06 Aug 2007 16:48:19 +0100 Subject: [Search-l] What Is Wikia and How Real Is It? In-Reply-To: <46B73565.2020802@wikia.com> References: <46B646C2.4040800@hackwatch.com> <46B71BD6.3080101@hackwatch.com> <46B72154.3080107@wikia.com> <46B73577.5020500@hackwatch.com> <46B73565.2020802@wikia.com> Message-ID: <46B742C3.4000407@hackwatch.com> Jimmy Wales wrote: > I love questions. :) And I love people making a living, nothing wrong > with that. Well in the search business everyone is looking for answers. Any chance of those bandwidth and hardware availability ones? :) Regards...jmcc -- ****************************************************** John McCormac * e-mail: jmcc at whoisireland.com MC2 * voice: +353-51-873640 22 Viewmount * web: http://www.whoisireland.com/ Waterford * blog: http://blog.whoisireland.com Ireland * Irish Domain Stats & Market Research ****************************************************** From chrisdesouza at yahoo.com Mon Aug 6 15:46:03 2007 From: chrisdesouza at yahoo.com (Chris Desouza) Date: Mon, 6 Aug 2007 08:46:03 -0700 (PDT) Subject: [Search-l] What Is Wikia and How Real Is It? In-Reply-To: <46B72154.3080107@wikia.com> Message-ID: <727995.76228.qm@web54107.mail.re2.yahoo.com> ah! the bickering about search dollars aside, i had a momentary lapse in distraction. we all have to keep this in mind - google grew gradually. it picked up speed out of the gate and had time and resources to shore up the tracks. wikia search will not have this advantage where scalability is concerned. with so much media coverage, wikia search must be prepared to handle the search onslaught. a search engine launch is one party where the host cannot afford to run out of food and drinks. google will find it's match. and soon enough! chris --- Jimmy Wales wrote: > John McCormac wrote: > > The venture is being portrayed as a Google Killer > in the media coverage > > and spin. The problem is that there is no actual > basis for such a claim > > other than it gives the media a nice soundbite and > keeps the investors > > happy. > > Actually, I think it makes the investors wonder what > kind of lunatic I > am. :) > > We are trying to downplay the "google killer" story > line, but it is a > great story line, and so the media runs with it > anyway. > > You would get the same story lines about RedHat and > Microsoft a few > years ago. It's an interesting story, but has > little relationship to > getting some work done. > > > So if I read this right, there is no search > engine? > > There currently is no search engine. This is a > project to build one, > but more importantly, to build this: > > > It is just an idea for a platform that is scalable > and can be used for > > search engine development? But without knowing the > processing > > requirements, the storage requirements and the > bandwidth requirements, > > it is difficult to design such a platform. > > Figuring out those things is part of the process, > yes? > > > The bandwidth required to spider tens of millions > of websites on an > > ongoing basis is considerable. Therefore such a > venture would need a lot > > of available bandwidth. > > > > The hardware is also a very significant > requirement. It would need a lot > > of servers to do a proper crawl of the web. It > would also require a > > backend to process the resulting data into > something usable. And a > > search interface would be required. > > Yes, so that matches my own very scientific > estimates. "a lot of > bandwidth" and a "lot of servers". :) > > > The search index is the hard part. It takes a long > time to develop a > > good, clean index. The Infinite Monkeys approach > to building an index > > (following links and hoping that they will lead to > new pages) is not the > > most efficient method of building an index quickly > when any of the prior > > requirements are absent or deficient. > > I absolutely agree with that. I don't think anyone > is proposing an > Infinite Monkey approach to spidering. > > > A good index makes the difference between a great > search engine and a > > spam infested pile of junk. I'm not convinced that > the Wikia people > > quite appreciate the level of work that goes into > that aspect of > > developing a search engine. Crawling a clearly > defined index such as > > that of Wikipedia or some other silo site is easy. > However crawling the > > web is like trying to take a slice of a swirling > nebula. > > Would it help if I say that I *do* appreciate the > level of work that > goes into that aspect of things? Not sure what you > are looking for here. > > The task at the moment for me is to design the > social aspect of the > community part of the site. The goal is to have > good tools to allow the > community to control the crawl in intelligent ways. > This is not > Infinite Monkeys,