From dan at wikia-inc.com Tue Jul 1 19:51:10 2008 From: dan at wikia-inc.com (Dan Lewis) Date: Tue, 1 Jul 2008 15:51:10 -0400 Subject: [Search-l] Search Team Update: July 1, 2008 Message-ID: <6704a5e60807011251ub497b94m9ea757ec3faf39af@mail.gmail.com> Here's what the Search team worked on last week: Nutch 2: * Continued work on Link Analysis * Continued work on Link Farm Analysis * Created WebGraphDb for preprocessing before Link Analysis * Created WebGraphDb reader * Improved page scoring tools Search Tools: * Created IPBlocking filter for KT * Added filtering for KT calls, including filtering by has attribute value and by newer timestamp * Added individual maximums for tuples in KT * Added request multiple tuples in a single call * Mocked up and began work on advanced API add * Worked on on-the-fly language translation * Worked on toolbar * Fixed scrolling of new items * Fixed header bug * Fixed some weird bug where adblock was blocking search results * Basic URL checking is in place * Added more "try also" sites Operations: * Enhanced KT performance * Planning Hbase upgrade to 0.1.3 * Moved overview.mov to Vimeo Community: * Cleaned up metrics * Started work on Wikia Search blog * Working on tool to allow you to find friends of friends * Added "Top Users" link on Metrics -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.wikia.com/pipermail/search-l/attachments/20080701/800e2013/attachment.html From dan at wikia-inc.com Wed Jul 2 17:26:06 2008 From: dan at wikia-inc.com (Dan Lewis) Date: Wed, 2 Jul 2008 13:26:06 -0400 Subject: [Search-l] Introducing the Wikia Search blog Message-ID: <6704a5e60807021026i2ccbf09bo250335d658dc895c@mail.gmail.com> (Crossposted to both Wikia-l and Search-l) Late last week, we opened the Wikia Search Blog, http://search.wikia.com/blog/ (You may have seen a link in the new navigation menu.) We're hoping to use it as a vehicle for surfacing the project generally, but also as a place for the community to discuss the project and web search in general. In fact, Jimmy has a post going live in about an hour about why Grub is so important to the future of the Internet. The blog does not aim to replacing the search-l mailing list, so feel free to participate in either or both place(s). There will be some redundancy, of course -- for example, I'm posting the weekly updates in both places. Also, while the blog necessarily is the "official" voice of Wikia, replete with the responsibilities and headaches thereto, the blog is not meant to be a walled garden. Everything is up for discussion, down to what license to make the content available under and the order of the sidebar. So please shoot over ideas, questions, concerns, etc. Dan -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.wikia.com/pipermail/search-l/attachments/20080702/00bd55c3/attachment.html From dan at wikia-inc.com Tue Jul 8 20:40:59 2008 From: dan at wikia-inc.com (Dan Lewis) Date: Tue, 8 Jul 2008 16:40:59 -0400 Subject: [Search-l] Search Team Update: July 8, 2008 Message-ID: <6704a5e60807081340p5f96b235rb4971e45f6327059@mail.gmail.com> Here's what the Search team did last week: Nutch: 1) Finished LinkRank algorithm 2) Finished LinkLoop idetifier tool 3) Finished LinkDumper tool 4) Finished NodeDumper tool 5) Finished NodeReader tool 6) Finished WebGraph Tool 7) Started working on 302 redirect errors in results This is pretty exciting in and of itself. It means that Nutch, and therefore Wikia Search, now has a stable link analysis algorithm in place that handles reciprocal links, link loops, most link farms and tight knit communities. The algorithm, therefore, will be able to consider not only the content of a page, but also the links incoming, to determine the relevancy and strength of each page on a keyword-by-keyword basis. The link analysis suite is not yet deployed -- give it a few days. We'll have more on the link analysis tool later this week. Operations: 1) Put some time in on more Grub backend tools 2) More work on re-indexing and scoring updates -- this is for the link analysis tools 3) Worked with the ISC to try and resolve some networking and storage issues Search Tools: 1) Added support for multiple keywords in a single KT call. 2) Metadata is not returned with all KT calls. 3) Started addition of url table and sort / fetch KT changes by url. 4) Worked on Firefox Toolbar 5) Began work on Advanced URL Add Functionality 6) Worked on spelling suggestions tool Community: 1) Launched Wikia Search blog 2) Began work on People You May Know tool -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.wikia.com/pipermail/search-l/attachments/20080708/0ff209ea/attachment.html From fredbaud at fairpoint.net Thu Jul 10 13:20:49 2008 From: fredbaud at fairpoint.net (Fred Bauder) Date: Thu, 10 Jul 2008 07:20:49 -0600 (MDT) Subject: [Search-l] Yahoo Is Inviting Partners to Build on Its Search Power Message-ID: <50983.66.243.196.131.1215696049.squirrel@webmail.fairpoint.net> Are we looking into this? Do we have an invitation? http://www.nytimes.com/2008/07/10/technology/10yahoo.html How would it work for us? Is Yahoo failing anyway into a takeover by Microsoft? Fred From patcito at gmail.com Fri Jul 11 03:07:58 2008 From: patcito at gmail.com (Patrick Aljord) Date: Thu, 10 Jul 2008 22:07:58 -0500 Subject: [Search-l] Introducing the Wikia Search blog In-Reply-To: <6704a5e60807021026i2ccbf09bo250335d658dc895c@mail.gmail.com> References: <6704a5e60807021026i2ccbf09bo250335d658dc895c@mail.gmail.com> Message-ID: <6b6419750807102007j6cf2094cs1c39540eb469a2d8@mail.gmail.com> perma links seem to be dead, example: http://search.wikia.com/blog/2008/07/10/search-stats-june-9-2008/ From jwales at wikia.com Fri Jul 11 14:40:44 2008 From: jwales at wikia.com (Jimmy Wales) Date: Fri, 11 Jul 2008 10:40:44 -0400 Subject: [Search-l] Yahoo Is Inviting Partners to Build on Its Search Power In-Reply-To: <50983.66.243.196.131.1215696049.squirrel@webmail.fairpoint.net> References: <50983.66.243.196.131.1215696049.squirrel@webmail.fairpoint.net> Message-ID: <487770EC.30607@wikia.com> Fred Bauder wrote: > Are we looking into this? Do we have an invitation? > > http://www.nytimes.com/2008/07/10/technology/10yahoo.html > > How would it work for us? Is Yahoo failing anyway into a takeover by > Microsoft? It is interesting how much buzz Yahoo is getting out of something that is so weakly "open". It is not open source at all. It's just an API that they are making available "free as in beer". And how long will it be free? Until it isn't. I can't imagine any significant ecosystem developing around something like this. Of course, Facebook has had great success with widgets on their proprietary platform, but as I understand this (and I have not yet read all the documentation, so please correct me if I am wrong) this just allows third parties to access Yahoo's proprietary search technology... not to access Yahoo's visitors. Facebook applications made sense to third party developers because you can develop a widget and market it through facebook in exchange for (typically) ad revenue from it. I can see this being a cute tool for some casual mashup development, and maybe some major content sites will use it to power their internal search. But I don't see it fundamentally affecting anything. From wikiasari at inbox.org Fri Jul 11 19:39:39 2008 From: wikiasari at inbox.org (Anthony) Date: Fri, 11 Jul 2008 15:39:39 -0400 Subject: [Search-l] Introducing the Wikia Search blog In-Reply-To: <6704a5e60807021026i2ccbf09bo250335d658dc895c@mail.gmail.com> References: <6704a5e60807021026i2ccbf09bo250335d658dc895c@mail.gmail.com> Message-ID: <71cd4dd90807111239m73457ae3jf01471e0eb18827d@mail.gmail.com> On Wed, Jul 2, 2008 at 1:26 PM, Dan Lewis wrote: > In fact, Jimmy has a post going live in > about an hour about why Grub is so important to the future of the Internet. > Maybe you or someone else on this list can explain this to me... What's the point of having a publicly available, up-to-date, complete copy of the web, if I have to access it over the Internet? I already have access to a publicly available, up-to-date, complete copy of the web, over the Internet. It is, the web itself. I suspect this is mainly just Jimmy and you oversimplifying things here. What all search engines need is an *indexed*, publicly available, up-to-date, complete copy of the web. But to add value a search engine needs much more than that, really. Say I invented the concept of pagerank, and wanted to add it on to the generic Wikia Search index. Without Grub/Wikia Search, I'd have to crawl the entire web noting links. With Grub/Wikia Search providing me just a copy the web (without pagerank data, since we're pretending that hasn't been invented yet), I haven't saved much. Sure, I don't have to deal with pipelining http requests to save on latency, but I still have to download the entire web in order to analyze the links. Now, presumably Grub/Wikia Search will offer me a *filtered* copy of the web so I only have to download the map of links. For those familiar with Wikipedia dumps, something like pagelinks.sql (for the entire web) would be great. But just a publicly available, up-to-date, complete copy of the web? Not useful at all. Anthony From jwales at wikia.com Fri Jul 11 23:10:35 2008 From: jwales at wikia.com (Jimmy Wales) Date: Fri, 11 Jul 2008 16:10:35 -0700 Subject: [Search-l] Introducing the Wikia Search blog In-Reply-To: <71cd4dd90807111239m73457ae3jf01471e0eb18827d@mail.gmail.com> References: <6704a5e60807021026i2ccbf09bo250335d658dc895c@mail.gmail.com> <71cd4dd90807111239m73457ae3jf01471e0eb18827d@mail.gmail.com> Message-ID: <4877E86B.1040706@wikia.com> Anthony, the job of the crawler is a lot more complex than you seem to realize. A publicly available, up-to-date, complete copy of the web is nontrivial to do, simply because must of what you get from http requests will not properly be considered part of "the web" due to spider traps, http://en.wikipedia.org/wiki/Spider_trap , etc. In any event, we provide the index, the algorithm, the data, everything publicly. --Jimbo From jeremie at jabber.org Sat Jul 12 17:00:48 2008 From: jeremie at jabber.org (Jeremie Miller) Date: Sat, 12 Jul 2008 12:00:48 -0500 Subject: [Search-l] Introducing the Wikia Search blog In-Reply-To: <71cd4dd90807111239m73457ae3jf01471e0eb18827d@mail.gmail.com> References: <6704a5e60807021026i2ccbf09bo250335d658dc895c@mail.gmail.com> <71cd4dd90807111239m73457ae3jf01471e0eb18827d@mail.gmail.com> Message-ID: <06C1E132-CAEC-432B-B203-57DC6554DCB5@jabber.org> > But just a publicly available, up-to-date, complete copy of the web? > Not useful at all. It's not just a big blob, the goal is to have various "functional" indexes of it and APIs into it, not a typical ranked keyword index but just the ability to select subsets based on URL, content-type, etc meta-data. Also the data is (in various states) loaded into a hadoop cluster and contributed MapReduce jobs can be run against it, the only restriction is that the MR jobs are open source and their outputs are available to everyone, this is a community resource. It's a little early yet, but work is progressing towards these goals for Grub :) Jer From dan at wikia-inc.com Wed Jul 16 13:45:28 2008 From: dan at wikia-inc.com (Dan Lewis) Date: Wed, 16 Jul 2008 09:45:28 -0400 Subject: [Search-l] Search Team Update: July 15, 2008 Message-ID: <6704a5e60807160645q51b750f6ybb677bfbc37fc73@mail.gmail.com> Here's what the Search Team did last week: Search Tools: 1) Finished enhanced redirect handling 2) Finished fixing of redirect errors in databases and search results 3) Finished deployment and testing of new search results based on new link rank algorithms. 4) Changed inbound link text in indexing to be untokenized. 4) Started working on new Indexer tool. 5) Started working on new Outlink parsing and analysis tools. 6) Added BOSS results as a supplement to our current results. The new indexer and outlink parsing tools will allow us to perform analysis on inbound link text and to store and index only relevant text. The new indexer will also allow us to specifiy per field / value weights for text. These tools, for example, should allow us to weight text such as "Google Homepage" higher than "Hotels" when pointing to google.com, and to avoid Google Bombs such as "Miserable Failure" for a search for Michael Moore resulting in better, more relevant, less spammy, search results. Operations: 1) Handled re-indexing 2) Started a new crawl 3) Made fixes to KT importer, adding the ability to load/populate the new location table 4) Built a new 0.1.3 Hbase cluster, loaded with production data snapshot, populated the new location table, setup tim's new KT code (with new features) pointed to the new cluster with new data (kt.search.isc.org/ktdev/) 5) Tweaked lots of system monitoring Other Tools: 1) Revised email notification text to include real name if available 2) Making great progress on the Wikia Search toolbar! From linasvepstas at gmail.com Thu Jul 17 01:06:16 2008 From: linasvepstas at gmail.com (Linas Vepstas) Date: Wed, 16 Jul 2008 20:06:16 -0500 Subject: [Search-l] Parsed text samples Message-ID: <3ae3aa420807161806r2fc6f0bep886690670da32ff5@mail.gmail.com> Hi, A minor announcement: I've uploaded some pre-parsed files onto http://relex.swlabs.org/~linas/data/ and in particular, to http://relex.swlabs.org/~linas/data/gutenberg/ and http://relex.swlabs.org/~linas/data/voa/ Look for the *.xml.gz files. These files contain text that has been parsed by the link-grammar english-language parser, and marked up with dependency relations relex. The file format is described at http://opencog.org/wiki/RelEx_compact_output (and Relex itself at http://opencog.org/wiki/RelEx) These files are quite large and verbose. You may ask yourself "what the heck is this stuff for?" -- and that's a good question. Start up a new email thread, and I'll happily brainstorm some of the practical and not-so-practical applications. --linas From monge.sergio at gmail.com Thu Jul 17 07:37:24 2008 From: monge.sergio at gmail.com (Sergio Monge) Date: Thu, 17 Jul 2008 09:37:24 +0200 Subject: [Search-l] Parsed text samples In-Reply-To: <3ae3aa420807161806r2fc6f0bep886690670da32ff5@mail.gmail.com> References: <3ae3aa420807161806r2fc6f0bep886690670da32ff5@mail.gmail.com> Message-ID: <487EF6B4.70204@gmail.com> what the heck is this stuff for? :-) Sergio Monge www.sergiomonge.com Linas Vepstas escribi?: > Hi, > > A minor announcement: I've uploaded some pre-parsed > files onto > > http://relex.swlabs.org/~linas/data/ and in particular, to > http://relex.swlabs.org/~linas/data/gutenberg/ > and > http://relex.swlabs.org/~linas/data/voa/ > > Look for the *.xml.gz files. > > These files contain text that has been parsed by the link-grammar > english-language parser, and marked up with dependency relations > relex. The file format is described at > > http://opencog.org/wiki/RelEx_compact_output > > (and Relex itself at http://opencog.org/wiki/RelEx) > > These files are quite large and verbose. > > You may ask yourself "what the heck is this stuff for?" -- and that's > a good question. Start up a new email thread, and I'll happily > brainstorm some of the practical and not-so-practical applications. > > --linas > _______________________________________________ > Wikia Search mailing list > http://re.search.wikia.com/ > Change options or unsubscribe: http://lists.wikia.com/mailman/options/search-l > > From linasvepstas at gmail.com Thu Jul 17 16:10:32 2008 From: linasvepstas at gmail.com (Linas Vepstas) Date: Thu, 17 Jul 2008 11:10:32 -0500 Subject: [Search-l] Parsed text samples In-Reply-To: <487EF6B4.70204@gmail.com> References: <3ae3aa420807161806r2fc6f0bep886690670da32ff5@mail.gmail.com> <487EF6B4.70204@gmail.com> Message-ID: <3ae3aa420807170910j9d55356jb6fa7920a36e6cc7@mail.gmail.com> 2008/7/17 Sergio Monge : > what the heck is this stuff for? Heh. Well, to improve search, of course! The idea is that by having lexical, semantic information, the quality of search results can be improved. It also should allow NLP queries: "Who won the 1957 World Series?" So, relex output identifies subject and object relations. In this case, "who" is the subject, "the 1957 world series" is the object, and "win" is the verb. So, we are looking for any text which has "win" as the verb, and "world series" as the object. Find that, and you've found the answer. I think the above could actually be fairly simple/straight-forward to implement: you have to make a giant table of subject, object, and URL. When a question is typed in, you search the table for matching subject/object. Whether this is better than keyword search, I dunno. Maybe just some of the time. But I think you can fold the scores in with keyword scores, and get better results. If we can get even basics like the above working on a large scale, then there are much fancier things that can be done. Besides, this is all supposed to be sexy/hot: Microsoft just paid $100M for Powerset, and, as best as I can tell, Powerset doesn't do much more than the above. There's a couple of other startups playing in this area cause its, uhh sexy hot. So, if nothing else, it allows wikia to claim its in the forefront with the latest "semantic web" technologies. --linas From jeremie at jabber.org Thu Jul 17 17:59:25 2008 From: jeremie at jabber.org (Jeremie Miller) Date: Thu, 17 Jul 2008 12:59:25 -0500 Subject: [Search-l] Parsed text samples In-Reply-To: <3ae3aa420807170910j9d55356jb6fa7920a36e6cc7@mail.gmail.com> References: <3ae3aa420807161806r2fc6f0bep886690670da32ff5@mail.gmail.com> <487EF6B4.70204@gmail.com> <3ae3aa420807170910j9d55356jb6fa7920a36e6cc7@mail.gmail.com> Message-ID: <7BFD4FB2-7762-4648-AB02-B11B90112D6A@jabber.org> It's definitely sexy hot, thanks Linas (and RelEx folks)! I'm looking forward to deploying this both in our map-reduce cluster on a large set of the top pages in the index (and posting the resulting data of course), as well as figuring out how we could better integrate this with Grub as a platform, I think the promise here to have rich tagged content is very very exciting :) Jer On Jul 17, 2008, at 11:10 AM, Linas Vepstas wrote: > 2008/7/17 Sergio Monge : >> what the heck is this stuff for? > > Heh. Well, to improve search, of course! The idea is that > by having lexical, semantic information, the quality of > search results can be improved. It also should allow > NLP queries: > > "Who won the 1957 World Series?" > > So, relex output identifies subject and object relations. > In this case, "who" is the subject, "the 1957 world series" > is the object, and "win" is the verb. So, we are looking > for any text which has "win" as the verb, and "world series" > as the object. Find that, and you've found the answer. > > I think the above could actually be fairly simple/straight-forward > to implement: you have to make a giant table of subject, > object, and URL. When a question is typed in, you search > the table for matching subject/object. > > Whether this is better than keyword search, I dunno. Maybe > just some of the time. But I think you can fold the scores > in with keyword scores, and get better results. > > If we can get even basics like the above working on a large > scale, then there are much fancier things that can be done. > > Besides, this is all supposed to be sexy/hot: Microsoft just > paid $100M for Powerset, and, as best as I can tell, Powerset > doesn't do much more than the above. There's a couple of > other startups playing in this area cause its, uhh sexy hot. > So, if nothing else, it allows wikia to claim its in the forefront > with the latest "semantic web" technologies. > > --linas > _______________________________________________ > Wikia Search mailing list > http://re.search.wikia.com/ > Change options or unsubscribe: http://lists.wikia.com/mailman/options/search-l > From linasvepstas at gmail.com Thu Jul 17 22:06:38 2008 From: linasvepstas at gmail.com (Linas Vepstas) Date: Thu, 17 Jul 2008 17:06:38 -0500 Subject: [Search-l] Parsed text samples In-Reply-To: <7BFD4FB2-7762-4648-AB02-B11B90112D6A@jabber.org> References: <3ae3aa420807161806r2fc6f0bep886690670da32ff5@mail.gmail.com> <487EF6B4.70204@gmail.com> <3ae3aa420807170910j9d55356jb6fa7920a36e6cc7@mail.gmail.com> <7BFD4FB2-7762-4648-AB02-B11B90112D6A@jabber.org> Message-ID: <3ae3aa420807171506j59078d27v63543f4e240f05f@mail.gmail.com> 2008/7/17 Jeremie Miller : > It's definitely sexy hot, thanks Linas (and RelEx folks)! > > I'm looking forward to deploying this both in our map-reduce cluster on a > large set of the top pages in the index (and posting the resulting data of > course), as well as figuring out how we could better integrate this with > Grub as a platform, I think the promise here to have rich tagged content is > very very exciting :) Well, someone still has to do the actual work of wiring all of this stuff up. The example I gave is a tip-of-the-iceberg, just the simplest example, an example where a query can be performed at high speed, using more-or-less the existing query mechanisms. There are certainly much fancier things one can do (and these are the things I'm working on, but for general use, not just search.) So we still need someone to stand up and say "gee, I understood that last bit, I'll hook the data in, and see how it works" -- measure and experiment, performance tune (I know of some weak spots), and if it all works, try out some of the fancier ideas. I can brainstorm and guide and provide advice but at least right now, I can't do the actual search-engine part of the work. I am very much interested in the results, because at least a part of what I need to accomplish requires a fast search-like ability to find related concepts, and this is the first step on that road. --linas (Linas wonders what Rich Jones plans to do for the rest of gsoc...) From newsmarkie at googlemail.com Fri Jul 18 19:35:15 2008 From: newsmarkie at googlemail.com (Mark (Markie)) Date: Fri, 18 Jul 2008 20:35:15 +0100 Subject: [Search-l] Widgets Message-ID: I know i havent been to active round here recently (been really busy, sorry :-( ) but one thing that i have noticed on the mailing lists that really interested me is the widgets ideas. IMO these are really key things that, shouldnt :-p, be too hard to implement. Ive had a brief look at one of the widgets which i look for alot, which is weather (although i dont know why, it always seems to be rain :-p) NWS / NOAA provide an XML output of weather for all US cities and most international cities, and with it being a US gov org all its content is therefore PD and totally free to use and copy etc (not a lawyer though). this we could then implement easily using a JS call from the search var, with or without looking for a "weather: XXX" keyword and then output next to the search results, maybe below the adverts. this could also be JS implemented to be hide able etc with more widgets....... the problem with NOAA IMO is that it requires calls to use lat/long co-ordinates for the calls, thus required more geo-location stuff which makes this harder and harder. so then i looked for alternative APIs which could be used and came up with weather.com's feed. this requires registration, but is then free to use but requires 5 links back and a logo, which may or maynot be acceptable to us?? this however does give the ease of being able to call by city name, removing the need for geo-locations stuff. however the balance needs to come between coding and technical possibility and the "freeness" of the content. also im guessing there are many different ideas for widgets so also please feel free to give thoughts on this one, and any others that spring to mind as being wanted. my thoughts were: Open Street Map contents when searching locations (ie a map of the location in the world/country/are when searching for a city) or cinema times for jimbo etc etc regards (and apologies for the lengthy message) mark -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.wikia.com/pipermail/search-l/attachments/20080718/b16c0d24/attachment.html From linasvepstas at gmail.com Fri Jul 18 21:05:57 2008 From: linasvepstas at gmail.com (Linas Vepstas) Date: Fri, 18 Jul 2008 16:05:57 -0500 Subject: [Search-l] Widgets In-Reply-To: References: Message-ID: <3ae3aa420807181405g353886bek21f94904a089b1cc@mail.gmail.com> 2008/7/18 Mark (Markie) : > interested me is the widgets ideas. IMO these are really key things that, Maybe this has been talked to death already, but I started using yahoo search recently, and I really like the keyword-refinement widget they have -- e.g. if you search for perl, it offers up a bunch of perl topics to pick from. I'm guessing that picking from this list improves both accuracy and speed. --linas From aerik at thesylvans.com Sat Jul 19 04:31:47 2008 From: aerik at thesylvans.com (Aerik Sylvan) Date: Fri, 18 Jul 2008 21:31:47 -0700 Subject: [Search-l] Widgets In-Reply-To: References: Message-ID: <355a36af0807182131l34ec2b62id534f6f56935d987@mail.gmail.com> On Fri, Jul 18, 2008 at 12:35 PM, Mark (Markie) wrote: > I know i havent been to active round here recently (been really busy, sorry > :-( ) but one thing that i have noticed on the mailing lists that really > interested me is the widgets ideas. IMO these are really key things that, > shouldnt :-p, be too hard to implement. Ive had a brief look at one of the > widgets which i look for alot, which is weather (although i dont know why, > it always seems to be rain :-p) > > NWS / NOAA provide an XML output of weather for all US cities and most > international cities, and with it being a US gov org all its content is > therefore PD and totally free to use and copy etc (not a lawyer though). > this we could then implement easily using a JS call from the search var, > with or without looking for a "weather: XXX" keyword and then output next to > the search results, maybe below the adverts. this could also be JS > implemented to be hide able etc with more widgets....... the problem with > NOAA IMO is that it requires calls to use lat/long co-ordinates for the > calls, thus required more geo-location stuff which makes this harder and > harder. > Hmm... couldn't you use a Google Maps API call (JS) to get the lat lang, then call the NOAA feed? I haven't looked at the NOAA feed at all, but... Then we cache the lat/lang in a cookie or something, to avoid a gazillion calls. Might be a little tricky with pure javascript, if the NOAA feed is only xml and not JSON, but... Aerik -- http://www.wikidweb.com - the Wiki Directory of the Web http://tagthis.info - Hosted Tagging for your website! -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.wikia.com/pipermail/search-l/attachments/20080718/d277f4f8/attachment.html From ger.dupont at gmail.com Sun Jul 20 15:39:29 2008 From: ger.dupont at gmail.com (=?ISO-8859-1?Q?G=E9rard_Dupont?=) Date: Sun, 20 Jul 2008 17:39:29 +0200 Subject: [Search-l] advanced search In-Reply-To: <471965e10806300026u1b256248i8c8b0eb2eb31b863@mail.gmail.com> References: <471965e10806300026u1b256248i8c8b0eb2eb31b863@mail.gmail.com> Message-ID: <471965e10807200839v5869c3dci27bc5314b62eae27@mail.gmail.com> Another dummy questions : is there any plan concerning multimedia ? I mean searching for images should be at least proposed and later video too. I will happily contribute on that part. gdupont 2008/6/30 G?rard Dupont : > Hi all, > > First of all, I must say that I recently go back to search wikia to test > the relevance of its results after a few months. Well, I must say that I'm > quite impressed by the results. That's really good ! Ok it's still suffer on > very recent topics and google beat it, but IMHO for stable thema, the > results are really good. > > I really like the idea that I can interact directly with the results and > for 2/3 queries, I actually saw the change when I launch (accidentally) the > same query some after some days. I don't really point out the benefit of the > social network aspect but it might be useful later. > > Now, I have a simple basic question, I don't think this is the right place > to post this question, but can't find a better one. So where is tha advanced > search part where I can use specific query syntax (site or domain > restriction is one of my favourite) ? Do you process the query in a specila > way or is the lucene syntax available ? > > Hope that someone has the answer. > > cheers > > -- > G?rard Dupont > Information Processing Competence Center (IPCC) - EADS DS > http://weblab-project.org > > Perception & Machine Learning team - LITIS Laboratory -- G?rard Dupont Information Processing Competence Center (IPCC) - EADS DS http://weblab-project.org Perception & Machine Learning team - LITIS Laboratory -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.wikia.com/pipermail/search-l/attachments/20080720/2611c696/attachment.html From patcito at gmail.com Sun Jul 20 22:07:06 2008 From: patcito at gmail.com (Patrick Aljord) Date: Sun, 20 Jul 2008 17:07:06 -0500 Subject: [Search-l] advanced search In-Reply-To: <471965e10807200839v5869c3dci27bc5314b62eae27@mail.gmail.com> References: <471965e10806300026u1b256248i8c8b0eb2eb31b863@mail.gmail.com> <471965e10807200839v5869c3dci27bc5314b62eae27@mail.gmail.com> Message-ID: <6b6419750807201507g2052b634sbec7d18d28005d27@mail.gmail.com> On Sun, Jul 20, 2008 at 10:39 AM, G?rard Dupont wrote: > Another dummy questions : is there any plan concerning multimedia ? I mean > searching for images should be at least proposed and later video too. I will > happily contribute on that part. > and searching mailing list and latest news too would be cool :) From marcnaweb at gmail.com Mon Jul 21 03:25:22 2008 From: marcnaweb at gmail.com (Marc .) Date: Mon, 21 Jul 2008 00:25:22 -0300 Subject: [Search-l] advanced search In-Reply-To: <6b6419750807201507g2052b634sbec7d18d28005d27@mail.gmail.com> References: <471965e10806300026u1b256248i8c8b0eb2eb31b863@mail.gmail.com> <471965e10807200839v5869c3dci27bc5314b62eae27@mail.gmail.com> <6b6419750807201507g2052b634sbec7d18d28005d27@mail.gmail.com> Message-ID: <5f2640d0807202025m6bc70661i9c0f00b113cd247@mail.gmail.com> I personally think that image search could be a major difference between Wikia Search and other big Search Engine (like google, yahoo etc) The big engine are really "bad" when proposing an image search. I think that, for a "simple" start, someone looking for a image of someone else in the net could upload an image with a face and the SE look in the net for other similar images (yes there is tech for it). Get a look in http://www.face-rec.org/ they have a lot of resources that could be applied for it (and no, i don't have any relations with face-reg.org). BR Marc Rosenfeld 2008/7/20 Patrick Aljord : > On Sun, Jul 20, 2008 at 10:39 AM, G?rard Dupont > wrote: > > Another dummy questions : is there any plan concerning multimedia ? I > mean > > searching for images should be at least proposed and later video too. I > will > > happily contribute on that part. > > > > and searching mailing list and latest news too would be cool :) > _______________________________________________ > Wikia Search mailing list > http://re.search.wikia.com/ > Change options or unsubscribe: > http://lists.wikia.com/mailman/options/search-l > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.wikia.com/pipermail/search-l/attachments/20080721/5cc2119d/attachment.html From borboleta at gmail.com Mon Jul 21 03:44:30 2008 From: borboleta at gmail.com (Bani) Date: Mon, 21 Jul 2008 00:44:30 -0300 Subject: [Search-l] advanced search In-Reply-To: <5f2640d0807202025m6bc70661i9c0f00b113cd247@mail.gmail.com> References: <471965e10806300026u1b256248i8c8b0eb2eb31b863@mail.gmail.com> <471965e10807200839v5869c3dci27bc5314b62eae27@mail.gmail.com> <6b6419750807201507g2052b634sbec7d18d28005d27@mail.gmail.com> <5f2640d0807202025m6bc70661i9c0f00b113cd247@mail.gmail.com> Message-ID: <35b94d690807202044h378f04e2s7446e4bcb3425fd6@mail.gmail.com> I think the competitive advantage Wikia is trying to achieve in Wikia Search is the social aspect of it, so things like face recognition algorithms and other stuff that requires intensive computation shouldn't be the priority. But Wikia could use the fact that it already has a community of people willing to collaborate to try more "human based computation", such as the ESP Game (http://www.gwap.com/gwap/gamesPreview/espgame/), later adopted my Google as an Image Labeler (http://images.google.com/imagelabeler/). On Mon, Jul 21, 2008 at 12:25 AM, Marc . wrote: > I personally think that image search could be a major difference between > Wikia Search and other big Search Engine (like google, yahoo etc) > The big engine are really "bad" when proposing an image search. > I think that, for a "simple" start, someone looking for a image of someone > else in the net could upload an image with a face and the SE look in the net > for other similar images (yes there is tech for it). Get a look in > http://www.face-rec.org/ they have a lot of resources that could be applied > for it (and no, i don't have any relations with face-reg.org). > > BR > Marc Rosenfeld > > > 2008/7/20 Patrick Aljord : >> >> On Sun, Jul 20, 2008 at 10:39 AM, G?rard Dupont >> wrote: >> > Another dummy questions : is there any plan concerning multimedia ? I >> > mean >> > searching for images should be at least proposed and later video too. I >> > will >> > happily contribute on that part. >> > >> >> and searching mailing list and latest news too would be cool :) >> _______________________________________________ >> Wikia Search mailing list >> http://re.search.wikia.com/ >> Change options or unsubscribe: >> http://lists.wikia.com/mailman/options/search-l > > > _______________________________________________ > Wikia Search mailing list > http://re.search.wikia.com/ > Change options or unsubscribe: > http://lists.wikia.com/mailman/options/search-l > From patcito at gmail.com Mon Jul 21 04:59:02 2008 From: patcito at gmail.com (Patrick Aljord) Date: Sun, 20 Jul 2008 23:59:02 -0500 Subject: [Search-l] advanced search In-Reply-To: <35b94d690807202044h378f04e2s7446e4bcb3425fd6@mail.gmail.com> References: <471965e10806300026u1b256248i8c8b0eb2eb31b863@mail.gmail.com> <471965e10807200839v5869c3dci27bc5314b62eae27@mail.gmail.com> <6b6419750807201507g2052b634sbec7d18d28005d27@mail.gmail.com> <5f2640d0807202025m6bc70661i9c0f00b113cd247@mail.gmail.com> <35b94d690807202044h378f04e2s7446e4bcb3425fd6@mail.gmail.com> Message-ID: <6b6419750807202159u742d44b5o33b06503adf8c055@mail.gmail.com> On Sun, Jul 20, 2008 at 10:44 PM, Bani wrote: > I think the competitive advantage Wikia is trying to achieve in Wikia > Search is the social aspect of it, so things like face recognition > algorithms and other stuff that requires intensive computation > shouldn't be the priority. Wikia also takes advantage of Grub to distribute intensive computation, so high computation shouldn't be a problem. From marcnaweb at gmail.com Mon Jul 21 05:43:46 2008 From: marcnaweb at gmail.com (Marc .) Date: Mon, 21 Jul 2008 02:43:46 -0300 Subject: [Search-l] advanced search In-Reply-To: <6b6419750807202159u742d44b5o33b06503adf8c055@mail.gmail.com> References: <471965e10806300026u1b256248i8c8b0eb2eb31b863@mail.gmail.com> <471965e10807200839v5869c3dci27bc5314b62eae27@mail.gmail.com> <6b6419750807201507g2052b634sbec7d18d28005d27@mail.gmail.com> <5f2640d0807202025m6bc70661i9c0f00b113cd247@mail.gmail.com> <35b94d690807202044h378f04e2s7446e4bcb3425fd6@mail.gmail.com> <6b6419750807202159u742d44b5o33b06503adf8c055@mail.gmail.com> Message-ID: <5f2640d0807202243q7cd208b8x987e53e7eb22890@mail.gmail.com> I think that Patrick is correct: Wikia Search shouldn't restrict itself "only" in a social search engine: with Grub it could be a SE based in intensive computation too. And personally, I think that people would like to "give away" a part of their computer capacity to allow a system that makes Wikia Seach distinctive from other usual Search Engine, and would Only be possible in Wikia Search --for the moment I don't see any other SE that offer this kind of tool. BR Marc Rosenfeld 2008/7/21 Patrick Aljord : > On Sun, Jul 20, 2008 at 10:44 PM, Bani wrote: > > I think the competitive advantage Wikia is trying to achieve in Wikia > > Search is the social aspect of it, so things like face recognition > > algorithms and other stuff that requires intensive computation > > shouldn't be the priority. > > Wikia also takes advantage of Grub to distribute intensive > computation, so high computation shouldn't be a problem. > _______________________________________________ > Wikia Search mailing list > http://re.search.wikia.com/ > Change options or unsubscribe: > http://lists.wikia.com/mailman/options/search-l > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.wikia.com/pipermail/search-l/attachments/20080721/d0487afb/attachment.html From newsmarkie at googlemail.com Mon Jul 21 14:55:07 2008 From: newsmarkie at googlemail.com (Mark (Markie)) Date: Mon, 21 Jul 2008 15:55:07 +0100 Subject: [Search-l] Widgets In-Reply-To: <355a36af0807182131l34ec2b62id534f6f56935d987@mail.gmail.com> References: <355a36af0807182131l34ec2b62id534f6f56935d987@mail.gmail.com> Message-ID: hmmm we could do but then we would be using more google resources (evil) but would then get free results (good), so we have to balance that up. i know there was some open source kit somewhere, that did geolocation by ip and JS calls etc, but i cant remember where/what it was. will have to see if i can dig it up again regards mark On Sat, Jul 19, 2008 at 5:31 AM, Aerik Sylvan wrote: > > > On Fri, Jul 18, 2008 at 12:35 PM, Mark (Markie) > wrote: > >> I know i havent been to active round here recently (been really busy, >> sorry :-( ) but one thing that i have noticed on the mailing lists that >> really interested me is the widgets ideas. IMO these are really key things >> that, shouldnt :-p, be too hard to implement. Ive had a brief look at one of >> the widgets which i look for alot, which is weather (although i dont know >> why, it always seems to be rain :-p) >> >> NWS / NOAA provide an XML output of weather for all US cities and most >> international cities, and with it being a US gov org all its content is >> therefore PD and totally free to use and copy etc (not a lawyer though). >> this we could then implement easily using a JS call from the search var, >> with or without looking for a "weather: XXX" keyword and then output next to >> the search results, maybe below the adverts. this could also be JS >> implemented to be hide able etc with more widgets....... the problem with >> NOAA IMO is that it requires calls to use lat/long co-ordinates for the >> calls, thus required more geo-location stuff which makes this harder and >> harder. >> > > Hmm... couldn't you use a Google Maps API call (JS) to get the lat lang, > then call the NOAA feed? I haven't looked at the NOAA feed at all, but... > Then we cache the lat/lang in a cookie or something, to avoid a gazillion > calls. Might be a little tricky with pure javascript, if the NOAA feed is > only xml and not JSON, but... > > Aerik > > -- > http://www.wikidweb.com - the Wiki Directory of the Web > http://tagthis.info - Hosted Tagging for your website! > > _______________________________________________ > Wikia Search mailing list > http://re.search.wikia.com/ > Change options or unsubscribe: > http://lists.wikia.com/mailman/options/search-l > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.wikia.com/pipermail/search-l/attachments/20080721/543b289c/attachment.html From newsmarkie at googlemail.com Mon Jul 21 21:34:35 2008 From: newsmarkie at googlemail.com (Mark (Markie)) Date: Mon, 21 Jul 2008 22:34:35 +0100 Subject: [Search-l] Counts Message-ID: 1,980,760 queries1,980,760 queries1,980,767 queries1,980,773 queries1,980,765 queries1,980,760 queries 766,886 contributions Im sure the counts were much more than this, or was it me? I thought we were somewhere around the 3,XXX,XXX mark? regards mark -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.wikia.com/pipermail/search-l/attachments/20080721/22f5e2c5/attachment.html From dan at wikia-inc.com Mon Jul 21 21:41:37 2008 From: dan at wikia-inc.com (Dan Lewis) Date: Mon, 21 Jul 2008 17:41:37 -0400 Subject: [Search-l] Counts In-Reply-To: References: Message-ID: <6704a5e60807211441u494fc6f7o3fa06da74e11ea9a@mail.gmail.com> It wasn't you. It was us. A caching problem caused the counter to tick upward at a really fast rate, and we fixed that bug -- and the counter with it. The true count is the one you see now. Sorry :( On Mon, Jul 21, 2008 at 5:34 PM, Mark (Markie) wrote: > 1,980,760 queries1,980,760 queries1,980,767 queries1,980,773 queries1,980,765 > queries1,980,760 queries > 766,886 contributions > > Im sure the counts were much more than this, or was it me? I thought we > were somewhere around the 3,XXX,XXX mark? > > regards > > mark > > _______________________________________________ > Wikia Search mailing list > http://re.search.wikia.com/ > Change options or unsubscribe: > http://lists.wikia.com/mailman/options/search-l > -- The message is intended only for the use of the individual(s) and/or entity to which it is addressed and may contain information that is privileged, confidential, and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this email is strictly prohibited. If you have received this email in error, please notify me by reply email and delete the original message. Thank you. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.wikia.com/pipermail/search-l/attachments/20080721/3ce1591a/attachment.html From newsmarkie at googlemail.com Mon Jul 21 21:44:52 2008 From: newsmarkie at googlemail.com (Mark (Markie)) Date: Mon, 21 Jul 2008 22:44:52 +0100 Subject: [Search-l] Counts In-Reply-To: <6704a5e60807211441u494fc6f7o3fa06da74e11ea9a@mail.gmail.com> References: <6704a5e60807211441u494fc6f7o3fa06da74e11ea9a@mail.gmail.com> Message-ID: ahh man, now i gotta change my bet :-( cheers mark On Mon, Jul 21, 2008 at 10:41 PM, Dan Lewis wrote: > It wasn't you. It was us. A caching problem caused the counter to tick > upward at a really fast rate, and we fixed that bug -- and the counter with > it. The true count is the one you see now. Sorry :( > > On Mon, Jul 21, 2008 at 5:34 PM, Mark (Markie) > wrote: > >> 1,980,760 queries1,980,760 queries1,980,767 queries1,980,773 queries1,980,765 >> queries1,980,760 queries >> 766,886 contributions >> >> Im sure the counts were much more than this, or was it me? I thought we >> were somewhere around the 3,XXX,XXX mark? >> >> regards >> >> mark >> >> _______________________________________________ >> Wikia Search mailing list >> http://re.search.wikia.com/ >> Change options or unsubscribe: >> http://lists.wikia.com/mailman/options/search-l >> > > > > -- > The message is intended only for the use of the individual(s) and/or entity > to which it is addressed and may contain information that is privileged, > confidential, and exempt from disclosure under applicable law. If the reader > of this message is not the intended recipient, you are hereby notified that > any dissemination, distribution, or copying of this email is strictly > prohibited. If you have received this email in error, please notify me by > reply email and delete the original message. Thank you. > > _______________________________________________ > Wikia Search mailing list > http://re.search.wikia.com/ > Change options or unsubscribe: > http://lists.wikia.com/mailman/options/search-l > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.wikia.com/pipermail/search-l/attachments/20080721/10d4cc49/attachment.html From jwales at wikia.com Mon Jul 21 16:08:38 2008 From: jwales at wikia.com (Jimmy Wales) Date: Mon, 21 Jul 2008 12:08:38 -0400 Subject: [Search-l] Widgets In-Reply-To: References: Message-ID: <4884B486.5040405@wikia.com> I am 100% focussed on widgets as the "next big thing" for Wikia search. We have a lot of bugfixes and refinements going on... and the toolbar is starting to look pretty amazing too. But I share Markie's view that the widget framework is going to be a killer app. The basic concept, I think I need help with some technical details, most of which will have to come from internal people like Jer. But I also have some general thoughts and questions which I think a lot of people will be able to help with. Basically, what I am wanting to see is a very very simple api that will let boring crappy programmers like me come up with some neat ideas and actually code them up quickly. The idea is: you match on certain regular expressions (though for performance reasons we probably can not support full regex) and return an "object" which might be a specialized search box, a grid of data, whatever. For example: '92109 weather' 'weather 92109' 'san diego weather' 'weather san diego' should return... something, the best and most free thing we can return. And 'SFO to LAX' should return a specialized search object, like what google does... The idea is this: 1. Ordinary programmers can upload triggers (either lists of keywords or... ??? how broad and computational can we make this?) 2. And those triggers call subroutines which return "objects" 3. And the community/staff decide which of these are returned "by default" for all users... but anyone can also go into advanced preferences and add the widgets experimentally that we collectively think are "not ready for prime time". --Jimbo From jwales at wikia.com Mon Jul 21 16:10:35 2008 From: jwales at wikia.com (Jimmy Wales) Date: Mon, 21 Jul 2008 12:10:35 -0400 Subject: [Search-l] Widgets In-Reply-To: <355a36af0807182131l34ec2b62id534f6f56935d987@mail.gmail.com> References: <355a36af0807182131l34ec2b62id534f6f56935d987@mail.gmail.com> Message-ID: <4884B4FB.4030508@wikia.com> Aerik Sylvan wrote: > Hmm... couldn't you use a Google Maps API call (JS) to get the lat lang, > then call the NOAA feed? I haven't looked at the NOAA feed at all, > but... Then we cache the lat/lang in a cookie or something, to avoid a > gazillion calls. Might be a little tricky with pure javascript, if the > NOAA feed is only xml and not JSON, but... I am 100% all for this, and I am also a free software and free data fanatic, and would love to see a way to make all of this available under a free license. So calling external APIs to proprietary servcies to make life better is great -- we are doing this now, calling Yahoo BOSS when we have no results -- but it is only great as long as we keep firmly focussed on: how can we also replace this with something really open and free. From jwales at wikia.com Mon Jul 21 16:11:16 2008 From: jwales at wikia.com (Jimmy Wales) Date: Mon, 21 Jul 2008 12:11:16 -0400 Subject: [Search-l] Widgets In-Reply-To: <3ae3aa420807181405g353886bek21f94904a089b1cc@mail.gmail.com> References: <3ae3aa420807181405g353886bek21f94904a089b1cc@mail.gmail.com> Message-ID: <4884B524.80208@wikia.com> Linas Vepstas wrote: > 2008/7/18 Mark (Markie) : > >> interested me is the widgets ideas. IMO these are really key things that, > > Maybe this has been talked to death already, but I started using > yahoo search recently, and I really like the keyword-refinement > widget they have -- e.g. if you search for perl, it offers up a bunch > of perl topics to pick from. I'm guessing that picking from this > list improves both accuracy and speed. Can you send us some links to sample searches to look at this? How hard would it be to replicate this functionality in free software? From jwales at wikia.com Mon Jul 21 16:24:31 2008 From: jwales at wikia.com (Jimmy Wales) Date: Mon, 21 Jul 2008 12:24:31 -0400 Subject: [Search-l] advanced search In-Reply-To: <5f2640d0807202025m6bc70661i9c0f00b113cd247@mail.gmail.com> References: <471965e10806300026u1b256248i8c8b0eb2eb31b863@mail.gmail.com> <471965e10807200839v5869c3dci27bc5314b62eae27@mail.gmail.com> <6b6419750807201507g2052b634sbec7d18d28005d27@mail.gmail.com> <5f2640d0807202025m6bc70661i9c0f00b113cd247@mail.gmail.com> Message-ID: <4884B83F.1070502@wikia.com> Marc . wrote: > I personally think that image search could be a major difference between > Wikia Search and other big Search Engine (like google, yahoo etc) > The big engine are really "bad" when proposing an image search. > I think that, for a "simple" start, someone looking for a image of > someone else in the net could upload an image with a face and the > SE look in the net for other similar images (yes there is tech for it). > Get a look in http://www.face-rec.org/ they have a lot of resources that > could be applied for it (and no, i don't have any relations with > face-reg.org ). I am on a plane and can't look right now, but is their work free software? From ger.dupont at gmail.com Tue Jul 22 12:59:05 2008 From: ger.dupont at gmail.com (=?ISO-8859-1?Q?G=E9rard_Dupont?=) Date: Tue, 22 Jul 2008 14:59:05 +0200 Subject: [Search-l] Widgets In-Reply-To: <4884B524.80208@wikia.com> References: <3ae3aa420807181405g353886bek21f94904a089b1cc@mail.gmail.com> <4884B524.80208@wikia.com> Message-ID: <471965e10807220559h7b579e63o852c1de8f92546f7@mail.gmail.com> Depending of the level of complexity you want in the suggestion is could be very simple to quite hard. You could simply use dictionnaries. Better is to use semantic sources to find close terms in the semantic space (quite harder I think) I think that what Yahoo do is a global relevance feedback, ie proposing terms and queries that close to the one you proposed and already proposed by others users (collaborative relevance feedback or so). You could also make relevance feedback on the user level matching its interests regarding past queries/click (if you store such data). 2008/7/21 Jimmy Wales : > Linas Vepstas wrote: > > 2008/7/18 Mark (Markie) : > > > >> interested me is the widgets ideas. IMO these are really key things > that, > > > > Maybe this has been talked to death already, but I started using > > yahoo search recently, and I really like the keyword-refinement > > widget they have -- e.g. if you search for perl, it offers up a bunch > > of perl topics to pick from. I'm guessing that picking from this > > list improves both accuracy and speed. > > Can you send us some links to sample searches to look at this? > > How hard would it be to replicate this functionality in free software? > > _______________________________________________ > Wikia Search mailing list > http://re.search.wikia.com/ > Change options or unsubscribe: > http://lists.wikia.com/mailman/options/search-l > -- G?rard Dupont Information Processing Competence Center (IPCC) - EADS DS http://weblab-project.org Perception & Machine Learning team - LITIS Laboratory -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.wikia.com/pipermail/search-l/attachments/20080722/12074212/attachment.html From ger.dupont at gmail.com Tue Jul 22 13:06:21 2008 From: ger.dupont at gmail.com (=?ISO-8859-1?Q?G=E9rard_Dupont?=) Date: Tue, 22 Jul 2008 15:06:21 +0200 Subject: [Search-l] advanced search In-Reply-To: <4884B83F.1070502@wikia.com> References: <471965e10806300026u1b256248i8c8b0eb2eb31b863@mail.gmail.com> <471965e10807200839v5869c3dci27bc5314b62eae27@mail.gmail.com> <6b6419750807201507g2052b634sbec7d18d28005d27@mail.gmail.com> <5f2640d0807202025m6bc70661i9c0f00b113cd247@mail.gmail.com> <4884B83F.1070502@wikia.com> Message-ID: <471965e10807220606y6955665fybfd347bf3b988228@mail.gmail.com> I'm not sure that face recognition is the good feature to add for now. The first step is to add indexation by content for media (image and more) and then using some tricks, face recognition (and more generally object recognition) could be used. There is plenty ways to implement image indexing (lot of feature extraction components exists and Lucene adaptation to image indexing too). 2008/7/21 Jimmy Wales : > Marc . wrote: > > I personally think that image search could be a major difference between > > Wikia Search and other big Search Engine (like google, yahoo etc) > > The big engine are really "bad" when proposing an image search. > > I think that, for a "simple" start, someone looking for a image of > > someone else in the net could upload an image with a face and the > > SE look in the net for other similar images (yes there is tech for it). > > Get a look in http://www.face-rec.org/ they have a lot of resources that > > could be applied for it (and no, i don't have any relations with > > face-reg.org ). > > I am on a plane and can't look right now, but is their work free software? > > _______________________________________________ > Wikia Search mailing list > http://re.search.wikia.com/ > Change options or unsubscribe: > http://lists.wikia.com/mailman/options/search-l > -- G?rard Dupont Information Processing Competence Center (IPCC) - EADS DS http://weblab-project.org Perception & Machine Learning team - LITIS Laboratory -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.wikia.com/pipermail/search-l/attachments/20080722/4a534844/attachment.html From newsmarkie at googlemail.com Tue Jul 22 13:47:47 2008 From: newsmarkie at googlemail.com (Mark (Markie)) Date: Tue, 22 Jul 2008 14:47:47 +0100 Subject: [Search-l] Widgets In-Reply-To: <4884B4FB.4030508@wikia.com> References: <355a36af0807182131l34ec2b62id534f6f56935d987@mail.gmail.com> <4884B4FB.4030508@wikia.com> Message-ID: hmm this is the problem i kinda outlined above, as to what amount do we accept the use of unfree services that may use info from our users in the "wrong" way, such as google, yahoo etc. i agree that we should try at all costs to stay free, but to what limits will we accept the use of apis etc from proprietaries? regards mark On Mon, Jul 21, 2008 at 5:10 PM, Jimmy Wales wrote: > Aerik Sylvan wrote: > > Hmm... couldn't you use a Google Maps API call (JS) to get the lat lang, > > then call the NOAA feed? I haven't looked at the NOAA feed at all, > > but... Then we cache the lat/lang in a cookie or something, to avoid a > > gazillion calls. Might be a little tricky with pure javascript, if the > > NOAA feed is only xml and not JSON, but... > > I am 100% all for this, and I am also a free software and free data > fanatic, and would love to see a way to make all of this available under > a free license. > > So calling external APIs to proprietary servcies to make life better is > great -- we are doing this now, calling Yahoo BOSS when we have no > results -- but it is only great as long as we keep firmly focussed on: > how can we also replace this with something really open and free. > > _______________________________________________ > Wikia Search mailing list > http://re.search.wikia.com/ > Change options or unsubscribe: > http://lists.wikia.com/mailman/options/search-l > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.wikia.com/pipermail/search-l/attachments/20080722/35d7b0c4/attachment.html From newsmarkie at googlemail.com Tue Jul 22 13:55:05 2008 From: newsmarkie at googlemail.com (Mark (Markie)) Date: Tue, 22 Jul 2008 14:55:05 +0100 Subject: [Search-l] advanced search In-Reply-To: <471965e10807220606y6955665fybfd347bf3b988228@mail.gmail.com> References: <471965e10806300026u1b256248i8c8b0eb2eb31b863@mail.gmail.com> <471965e10807200839v5869c3dci27bc5314b62eae27@mail.gmail.com> <6b6419750807201507g2052b634sbec7d18d28005d27@mail.gmail.com> <5f2640d0807202025m6bc70661i9c0f00b113cd247@mail.gmail.com> <4884B83F.1070502@wikia.com> <471965e10807220606y6955665fybfd347bf3b988228@mail.gmail.com> Message-ID: i agree that first we need to be able to handle the indexing of non-text content such as pictures etc before moving onto specific features for this content which would be useless without it (like building the roof of the house without any walls) is this being considered by the wikia team at all? or do they have any thoughts as to how to progress with this at all. IMO we should consider grub for this in the same way as we already do with text regards mark On Tue, Jul 22, 2008 at 2:06 PM, G?rard Dupont wrote: > I'm not sure that face recognition is the good feature to add for now. The > first step is to add indexation by content for media (image and more) and > then using some tricks, face recognition (and more generally object > recognition) could be used. > > There is plenty ways to implement image indexing (lot of feature extraction > components exists and Lucene adaptation to image indexing too). > > 2008/7/21 Jimmy Wales : > > Marc . wrote: >> > I personally think that image search could be a major difference between >> > Wikia Search and other big Search Engine (like google, yahoo etc) >> > The big engine are really "bad" when proposing an image search. >> > I think that, for a "simple" start, someone looking for a image of >> > someone else in the net could upload an image with a face and the >> > SE look in the net for other similar images (yes there is tech for it). >> > Get a look in http://www.face-rec.org/ they have a lot of resources >> that >> > could be applied for it (and no, i don't have any relations with >> > face-reg.org ). >> >> I am on a plane and can't look right now, but is their work free software? >> >> _______________________________________________ >> Wikia Search mailing list >> http://re.search.wikia.com/ >> Change options or unsubscribe: >> http://lists.wikia.com/mailman/options/search-l >> > > > > -- > G?rard Dupont > Information Processing Competence Center (IPCC) - EADS DS > http://weblab-project.org > > Perception & Machine Learning team - LITIS Laboratory > > _______________________________________________ > Wikia Search mailing list > http://re.search.wikia.com/ > Change options or unsubscribe: > http://lists.wikia.com/mailman/options/search-l > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.wikia.com/pipermail/search-l/attachments/20080722/b502da46/attachment.html From ger.dupont at gmail.com Tue Jul 22 14:02:18 2008 From: ger.dupont at gmail.com (=?ISO-8859-1?Q?G=E9rard_Dupont?=) Date: Tue, 22 Jul 2008 16:02:18 +0200 Subject: [Search-l] Widgets In-Reply-To: References: <355a36af0807182131l34ec2b62id534f6f56935d987@mail.gmail.com> <4884B4FB.4030508@wikia.com> Message-ID: <471965e10807220702x7d2d7f3ch8be4a03e3c21ab9a@mail.gmail.com> I was not talking about using what yahoo or google suggestion feature, but about how they do it. Actually I'm not aware of any open (neither closed) services that provide query/terms suggestion. About the use of users data, I don't see the "wrong" way in query/terms suggestion as far as the user is aware that its search history could be used to enhance to engine (which is the aim of search wikia). We could do it in an open way using for instance APML profile for users and give them control on it. 2008/7/22 Mark (Markie) : > hmm this is the problem i kinda outlined above, as to what amount do we > accept the use of unfree services that may use info from our users in the > "wrong" way, such as google, yahoo etc. i agree that we should try at all > costs to stay free, but to what limits will we accept the use of apis etc > from proprietaries? > > regards > > mark > > > On Mon, Jul 21, 2008 at 5:10 PM, Jimmy Wales wrote: > >> Aerik Sylvan wrote: >> > Hmm... couldn't you use a Google Maps API call (JS) to get the lat lang, >> > then call the NOAA feed? I haven't looked at the NOAA feed at all, >> > but... Then we cache the lat/lang in a cookie or something, to avoid a >> > gazillion calls. Might be a little tricky with pure javascript, if the >> > NOAA feed is only xml and not JSON, but... >> >> I am 100% all for this, and I am also a free software and free data >> fanatic, and would love to see a way to make all of this available under >> a free license. >> >> So calling external APIs to proprietary servcies to make life better is >> great -- we are doing this now, calling Yahoo BOSS when we have no >> results -- but it is only great as long as we keep firmly focussed on: >> how can we also replace this with something really open and free. >> >> _______________________________________________ >> Wikia Search mailing list >> http://re.search.wikia.com/ >> Change options or unsubscribe: >> http://lists.wikia.com/mailman/options/search-l >> > > > _______________________________________________ > Wikia Search mailing list > http://re.search.wikia.com/ > Change options or unsubscribe: > http://lists.wikia.com/mailman/options/search-l > -- G?rard Dupont Information Processing Competence Center (IPCC) - EADS DS http://weblab-project.org Perception & Machine Learning team - LITIS Laboratory -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.wikia.com/pipermail/search-l/attachments/20080722/0375d605/attachment.html From jwales at wikia.com Tue Jul 22 18:27:33 2008 From: jwales at wikia.com (Jimmy Wales) Date: Tue, 22 Jul 2008 14:27:33 -0400 Subject: [Search-l] Widgets In-Reply-To: <471965e10807220559h7b579e63o852c1de8f92546f7@mail.gmail.com> References: <3ae3aa420807181405g353886bek21f94904a089b1cc@mail.gmail.com> <4884B524.80208@wikia.com> <471965e10807220559h7b579e63o852c1de8f92546f7@mail.gmail.com> Message-ID: <48862695.4030700@wikia.com> G?rard Dupont wrote: > You could simply use dictionnaries. *nod* > Better is to use semantic sources to find close terms in the semantic > space (quite harder I think) Yes, but with an open framework like I am envisioning, one way to handle this would be for independent people (anyone who feels like it) to generate clever dictionaries using semantic fancyness... and to make their algorithms free for others to play with. This would lead to a lot of innovation, and if we do the API well, then it should be simple for any half-decent coder to whip up something simple and clever. From jwales at wikia.com Tue Jul 22 20:27:07 2008 From: jwales at wikia.com (Jimmy Wales) Date: Tue, 22 Jul 2008 16:27:07 -0400 Subject: [Search-l] Widgets In-Reply-To: References: <355a36af0807182131l34ec2b62id534f6f56935d987@mail.gmail.com> <4884B4FB.4030508@wikia.com> Message-ID: <4886429B.3010107@wikia.com> Mark (Markie) wrote: > hmm this is the problem i kinda outlined above, as to what amount do we > accept the use of unfree services that may use info from our users in > the "wrong" way, such as google, yahoo etc. i agree that we should try > at all costs to stay free, but to what limits will we accept the use of > apis etc from proprietaries? I think that's an excellent excellent question. I don't know the answer. We link to lots of proprietary things, of course. Nothing wrong with that, we are a search engine. And some things are not likely to be free-as-in-speech for a long time, if ever. Realtime weather data could be, in theory, at least for the US, not sure about other places. But the government publishes a ton of weather information in pretty unusable formats... we could sort it out and have a nice system... maybe. :) But movie showtimes in my area? Hard for that to be free. --Jimbo From jwales at wikia.com Tue Jul 22 20:36:08 2008 From: jwales at wikia.com (Jimmy Wales) Date: Tue, 22 Jul 2008 16:36:08 -0400 Subject: [Search-l] Widgets In-Reply-To: References: <355a36af0807182131l34ec2b62id534f6f56935d987@mail.gmail.com> Message-ID: <488644B8.2010307@wikia.com> Mark (Markie) wrote: > hmmm we could do but then we would be using more google resources (evil) > but would then get free results (good), so we have to balance that up. > i know there was some open source kit somewhere, that did geolocation by > ip and JS calls etc, but i cant remember where/what it was. will have > to see if i can dig it up again Just to be clear, I do not think that google is evil. Of course I think it is better when we think really really hard about how to create free (in the sense of GNU) alternatives to currently proprietary solutions, but I don't think we should be paranoid about google or any other company... at least not to the point that it hinders our work. :) --Jimbo From newsmarkie at googlemail.com Tue Jul 22 20:44:40 2008 From: newsmarkie at googlemail.com (Mark (Markie)) Date: Tue, 22 Jul 2008 21:44:40 +0100 Subject: [Search-l] Widgets In-Reply-To: <488644B8.2010307@wikia.com> References: <355a36af0807182131l34ec2b62id534f6f56935d987@mail.gmail.com> <488644B8.2010307@wikia.com> Message-ID: heh, you should know not to take me seriously when i say stuff like that, i say "evil" in the loose way that it is not as good as "us", but really im still a google user and business is business so i dont really hold anything against them (apart from not being WS :-p) mark On Tue, Jul 22, 2008 at 9:36 PM, Jimmy Wales wrote: > Mark (Markie) wrote: > > hmmm we could do but then we would be using more google resources (evil) > > but would then get free results (good), so we have to balance that up. > > i know there was some open source kit somewhere, that did geolocation by > > ip and JS calls etc, but i cant remember where/what it was. will have > > to see if i can dig it up again > > Just to be clear, I do not think that google is evil. Of course I think > it is better when we think really really hard about how to create free > (in the sense of GNU) alternatives to currently proprietary solutions, > but I don't think we should be paranoid about google or any other > company... at least not to the point that it hinders our work. :) > > --Jimbo > _______________________________________________ > Wikia Search mailing list > http://re.search.wikia.com/ > Change options or unsubscribe: > http://lists.wikia.com/mailman/options/search-l > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.wikia.com/pipermail/search-l/attachments/20080722/486205a4/attachment.html From jwales at wikia.com Tue Jul 22 20:45:48 2008 From: jwales at wikia.com (Jimmy Wales) Date: Tue, 22 Jul 2008 16:45:48 -0400 Subject: [Search-l] Widgets In-Reply-To: References: <355a36af0807182131l34ec2b62id534f6f56935d987@mail.gmail.com> <488644B8.2010307@wikia.com> Message-ID: <488646FC.3050809@wikia.com> :-) Yeah! :) Mark (Markie) wrote: > heh, you should know not to take me seriously when i say stuff like > that, i say "evil" in the loose way that it is not as good as "us", but > really im still a google user and business is business so i dont really > hold anything against them (apart from not being WS :-p) > > mark > > On Tue, Jul 22, 2008 at 9:36 PM, Jimmy Wales > wrote: > > Mark (Markie) wrote: > > hmmm we could do but then we would be using more google resources > (evil) > > but would then get free results (good), so we have to balance > that up. > > i know there was some open source kit somewhere, that did > geolocation by > > ip and JS calls etc, but i cant remember where/what it was. will > have > > to see if i can dig it up again > > Just to be clear, I do not think that google is evil. Of course I think > it is better when we think really really hard about how to create free > (in the sense of GNU) alternatives to currently proprietary solutions, > but I don't think we should be paranoid about google or any other > company... at least not to the point that it hinders our work. :) > > --Jimbo > _______________________________________________ > Wikia Search mailing list > http://re.search.wikia.com/ > Change options or unsubscribe: > http://lists.wikia.com/mailman/options/search-l > > > > ------------------------------------------------------------------------ > > _______________________________________________ > Wikia Search mailing list > http://re.search.wikia.com/ > Change options or unsubscribe: http://lists.wikia.com/mailman/options/search-l From SvenDowideit at home.org.au Tue Jul 22 23:46:25 2008 From: SvenDowideit at home.org.au (Sven Dowideit) Date: Wed, 23 Jul 2008 09:46:25 +1000 Subject: [Search-l] Widgets In-Reply-To: References: <355a36af0807182131l34ec2b62id534f6f56935d987@mail.gmail.com> <4884B4FB.4030508@wikia.com> Message-ID: <48867151.3090705@home.org.au> A quick search for open geo-data reveals services like http://www.geonames.org/about.html CC attribution licensed, and they bother to cite sources - I've not looked at where Google gets its data, but I'm always suspicious of un-cited commercial 'datasets'. Sven Mark (Markie) wrote: > hmm this is the problem i kinda outlined above, as to what amount do we > accept the use of unfree services that may use info from our users in the > "wrong" way, such as google, yahoo etc. i agree that we should try at all > costs to stay free, but to what limits will we accept the use of apis etc > from proprietaries? > > regards > > mark > > On Mon, Jul 21, 2008 at 5:10 PM, Jimmy Wales wrote: > >> Aerik Sylvan wrote: >>> Hmm... couldn't you use a Google Maps API call (JS) to get the lat lang, >>> then call the NOAA feed? I haven't looked at the NOAA feed at all, >>> but... Then we cache the lat/lang in a cookie or something, to avoid a >>> gazillion calls. Might be a little tricky with pure javascript, if the >>> NOAA feed is only xml and not JSON, but... >> I am 100% all for this, and I am also a free software and free data >> fanatic, and would love to see a way to make all of this available under >> a free license. >> >> So calling external APIs to proprietary servcies to make life better is >> great -- we are doing this now, calling Yahoo BOSS when we have no >> results -- but it is only great as long as we keep firmly focussed on: >> how can we also replace this with something really open and free. >> >> _______________________________________________ >> Wikia Search mailing list >> http://re.search.wikia.com/ >> Change options or unsubscribe: >> http://lists.wikia.com/mailman/options/search-l >> > > > ------------------------------------------------------------------------ > > _______________________________________________ > Wikia Search mailing list > http://re.search.wikia.com/ > Change options or unsubscribe: http://lists.wikia.com/mailman/options/search-l -- Professional Wiki Innovation and Support Sven Dowideit - http://DistributedINFORMATION.com A WikiRing Partner - http://wikiring.com Public key - http://pgp.mit.edu:11371/pks/lookup?search=Sven+Dowideit&op=index&exact=on From newsmarkie at googlemail.com Wed Jul 23 13:39:54 2008 From: newsmarkie at googlemail.com (Mark (Markie)) Date: Wed, 23 Jul 2008 14:39:54 +0100 Subject: [Search-l] Widgets In-Reply-To: <48867151.3090705@home.org.au> References: <355a36af0807182131l34ec2b62id534f6f56935d987@mail.gmail.com> <4884B4FB.4030508@wikia.com> <48867151.3090705@home.org.au> Message-ID: hmm that looks useful :-) mark On Wed, Jul 23, 2008 at 12:46 AM, Sven Dowideit wrote: > A quick search for open geo-data reveals services like > http://www.geonames.org/about.html > > CC attribution licensed, and they bother to cite sources - I've not > looked at where Google gets its data, but I'm always suspicious of > un-cited commercial 'datasets'. > > Sven > > Mark (Markie) wrote: > > hmm this is the problem i kinda outlined above, as to what amount do we > > accept the use of unfree services that may use info from our users in the > > "wrong" way, such as google, yahoo etc. i agree that we should try at > all > > costs to stay free, but to what limits will we accept the use of apis etc > > from proprietaries? > > > > regards > > > > mark > > > > On Mon, Jul 21, 2008 at 5:10 PM, Jimmy Wales wrote: > > > >> Aerik Sylvan wrote: > >>> Hmm... couldn't you use a Google Maps API call (JS) to get the lat > lang, > >>> then call the NOAA feed? I haven't looked at the NOAA feed at all, > >>> but... Then we cache the lat/lang in a cookie or something, to avoid a > >>> gazillion calls. Might be a little tricky with pure javascript, if the > >>> NOAA feed is only xml and not JSON, but... > >> I am 100% all for this, and I am also a free software and free data > >> fanatic, and would love to see a way to make all of this available under > >> a free license. > >> > >> So calling external APIs to proprietary servcies to make life better is > >> great -- we are doing this now, calling Yahoo BOSS when we have no > >> results -- but it is only great as long as we keep firmly focussed on: > >> how can we also replace this with something really open and free. > >> > >> _______________________________________________ > >> Wikia Search mailing list > >> http://re.search.wikia.com/ > >> Change options or unsubscribe: > >> http://lists.wikia.com/mailman/options/search-l > >> > > > > > > ------------------------------------------------------------------------ > > > > _______________________________________________ > > Wikia Search mailing list > > http://re.search.wikia.com/ > > Change options or unsubscribe: > http://lists.wikia.com/mailman/options/search-l > > -- > Professional Wiki Innovation and Support > Sven Dowideit - http://DistributedINFORMATION.com > A WikiRing Partner - http://wikiring.com > Public key - > http://pgp.mit.edu:11371/pks/lookup?search=Sven+Dowideit&op=index&exact=on > _______________________________________________ > Wikia Search mailing list > http://re.search.wikia.com/ > Change options or unsubscribe: > http://lists.wikia.com/mailman/options/search-l > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.wikia.com/pipermail/search-l/attachments/20080723/85f9616a/attachment.html From dan at wikia-inc.com Wed Jul 23 18:33:56 2008 From: dan at wikia-inc.com (Dan Lewis) Date: Wed, 23 Jul 2008 14:33:56 -0400 Subject: [Search-l] Search Team Update: July 23, 2008 Message-ID: <6704a5e60807231133q5193ecaagbe019d0eb8a55997@mail.gmail.com> Here's what the Search team did last week: Nutch * Finish FieldIndexer * Finished BasicFields * Working on AnchorFields These are all part of the new Indexer that will allow fine grained control of fields that go into our index. The FieldIndexer is the actual indexer itself that replaces the current Nutch indexer. The BasicFields replaces the current nutch functionality for fields from the indexer and BasicIndexingFilter plugin. The AnchorFields both replaces the current AnchorIndexingPlugin and enhances it to allow analysis and ordering by score of anchors to be indexed. The AnchorFields job should be finished, tested, and ready for larger deployment early this coming week Search Tools * Worked on a new, experimental "light" fork to the results UI * Lots of work testing new KT tools in development * Brainstorming about the widget framework and how to speed up results * Lots of work with the crawler, trying to find the source of very high fetch failure rates * Continued development of the toolbar Community Tools: * Began work on a contact importer * Created interface for translating Wikia Search interface Operations: * Deploy-redploy of KT /ktdev/, started review of code * Started work on determining new hardware requirements -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.wikia.com/pipermail/search-l/attachments/20080723/fc70c8ab/attachment.html From dirson at gmail.com Wed Jul 23 18:41:17 2008 From: dirson at gmail.com (Dirson) Date: Wed, 23 Jul 2008 20:41:17 +0200 Subject: [Search-l] Google Knol launched Message-ID: http://knol.google.com You can earn $$$ by writing knols. From kubes at apache.org Fri Jul 25 19:08:28 2008 From: kubes at apache.org (Dennis Kubes) Date: Fri, 25 Jul 2008 14:08:28 -0500 Subject: [Search-l] Interesting Article Message-ID: <488A24AC.4020900@apache.org> http://googleblog.blogspot.com/2008/07/we-knew-web-was-big.html From jwales at wikia.com Sun Jul 27 15:03:34 2008 From: jwales at wikia.com (Jimmy Wales) Date: Sun, 27 Jul 2008 10:03:34 -0500 Subject: [Search-l] Interesting Article In-Reply-To: <488A24AC.4020900@apache.org> References: <488A24AC.4020900@apache.org> Message-ID: <488C8E46.3050508@wikia.com> Dennis Kubes wrote: > http://googleblog.blogspot.com/2008/07/we-knew-web-was-big.html Weird. 1 trillion unique web pages. I am skeptical. That's 166 pages per person on earth. Or, if we assume there are 1 billion people online, that's 1,000 pages for every person online. I don't know about you, but I haven't written 1,000 web pages yet. If they are data-driven pages, that's interesting and all, but "counting" pages from a data-driven site is a bit silly. Even the blog post acknowledges this, by talking about how a calendar site has, theoretically, an infinite number of pages. --Jimbo From chrisdesouza at yahoo.com Sun Jul 27 15:34:20 2008 From: chrisdesouza at yahoo.com (Chris Desouza) Date: Sun, 27 Jul 2008 08:34:20 -0700 (PDT) Subject: [Search-l] Interesting Article In-Reply-To: <488C8E46.3050508@wikia.com> Message-ID: <385688.50757.qm@web54107.mail.re2.yahoo.com> You forgot "porn", Jimmy boy... http://www.focusonit.info/?p=30796 --- On Sun, 7/27/08, Jimmy Wales wrote: From: Jimmy Wales Subject: Re: [Search-l] Interesting Article To: "Mailing list for Search Wikia" Date: Sunday, July 27, 2008, 8:33 PM Dennis Kubes wrote: > http://googleblog.blogspot.com/2008/07/we-knew-web-was-big.html Weird. 1 trillion unique web pages. I am skeptical. That's 166 pages per person on earth. Or, if we assume there are 1 billion people online, that's 1,000 pages for every person online. I don't know about you, but I haven't written 1,000 web pages yet. If they are data-driven pages, that's interesting and all, but "counting" pages from a data-driven site is a bit silly. Even the blog post acknowledges this, by talking about how a calendar site has, theoretically, an infinite number of pages. --Jimbo _______________________________________________ Wikia Search mailing list http://re.search.wikia.com/ Change options or unsubscribe: http://lists.wikia.com/mailman/options/search-l -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.wikia.com/pipermail/search-l/attachments/20080727/2fe02e77/attachment.html From kubes at apache.org Sun Jul 27 15:56:20 2008 From: kubes at apache.org (Dennis Kubes) Date: Sun, 27 Jul 2008 10:56:20 -0500 Subject: [Search-l] Interesting Article In-Reply-To: <488C8E46.3050508@wikia.com> References: <488A24AC.4020900@apache.org> <488C8E46.3050508@wikia.com> Message-ID: <488C9AA4.90900@apache.org> I found it interesting in that we recently passed 1 billion urls that we know about. Dennis Jimmy Wales wrote: > Dennis Kubes wrote: >> http://googleblog.blogspot.com/2008/07/we-knew-web-was-big.html > > Weird. > > 1 trillion unique web pages. I am skeptical. That's 166 pages per > person on earth. Or, if we assume there are 1 billion people online, > that's 1,000 pages for every person online. I don't know about you, but > I haven't written 1,000 web pages yet. > > If they are data-driven pages, that's interesting and all, but > "counting" pages from a data-driven site is a bit silly. Even the blog > post acknowledges this, by talking about how a calendar site has, > theoretically, an infinite number of pages. > > --Jimbo > > _______________________________________________ > Wikia Search mailing list > http://re.search.wikia.com/ > Change options or unsubscribe: http://lists.wikia.com/mailman/options/search-l From linasvepstas at gmail.com Sun Jul 27 16:09:34 2008 From: linasvepstas at gmail.com (Linas Vepstas) Date: Sun, 27 Jul 2008 11:09:34 -0500 Subject: [Search-l] Interesting Article In-Reply-To: <385688.50757.qm@web54107.mail.re2.yahoo.com> References: <488C8E46.3050508@wikia.com> <385688.50757.qm@web54107.mail.re2.yahoo.com> Message-ID: <3ae3aa420807270909r19c4831eve59880a50d6dfa09@mail.gmail.com> 2008/7/27 Chris Desouza : > You forgot "porn", Jimmy boy... > http://www.focusonit.info/?p=30796 Hmm, There aren't 1000 photos of naked me online yet. > --- On Sun, 7/27/08, Jimmy Wales wrote: > Dennis Kubes wrote: >> http://googleblog.blogspot.com/2008/07/we-knew-web-was-big.html > > Weird. > > 1 trillion unique web pages. I am skeptical. That's 166 pages per > person on earth. Or, if we assume there are 1 billion people online, > that's 1,000 pages for every person online. I don't know about you, > but > I haven't written > 1,000 web pages yet. But google does scan mailing list archives, and you might have written 1000 emails by now ... --linas From michael at md-d.org Sun Jul 27 16:51:16 2008 From: michael at md-d.org (Michael Diederich) Date: Sun, 27 Jul 2008 18:51:16 +0200 Subject: [Search-l] Interesting Article In-Reply-To: <488C8E46.3050508@wikia.com> References: <488A24AC.4020900@apache.org> <488C8E46.3050508@wikia.com> Message-ID: Hi, On Sun, Jul 27, 2008 at 5:03 PM, Jimmy Wales wrote: > 1 trillion unique web pages. I am skeptical. That's 166 pages per > person on earth. Or, if we assume there are 1 billion people online, > that's 1,000 pages for every person online. I don't know about you, but > I haven't written 1,000 web pages yet. Within my company's content management system, we have all market data about light commercial vehicles - each spider gets about 700.000 pages, with best compression the result is 2.3GB. For myself, I wrote more than 50.000 usenet postings, each of them is indexed by google. I am quite sure, you wrote more than 1000 pages ;) > If they are data-driven pages, that's interesting and all, but > "counting" pages from a data-driven site is a bit silly. Even the blog > post acknowledges this, by talking about how a calendar site has, > theoretically, an infinite number of pages. They count unique content. An empty page just counts one. Kind regards, Michael 'da didi' Diederich -- Student: Master of Computer Science University Duisburg-Essen http://de.wikipedia.org/wiki/Benutzer:MichaelDiederich http://www.md-d.org/ From jwales at wikia.com Mon Jul 28 00:59:18 2008 From: jwales at wikia.com (Jimmy Wales) Date: Sun, 27 Jul 2008 19:59:18 -0500 Subject: [Search-l] Interesting Article In-Reply-To: <3ae3aa420807270909r19c4831eve59880a50d6dfa09@mail.gmail.com> References: <488C8E46.3050508@wikia.com> <385688.50757.qm@web54107.mail.re2.yahoo.com> <3ae3aa420807270909r19c4831eve59880a50d6dfa09@mail.gmail.com> Message-ID: <488D19E6.30304@wikia.com> Linas Vepstas wrote: > But google does scan mailing list archives, and you might have written > 1000 emails by now ... True dat. From dan at wikia-inc.com Mon Jul 28 14:44:00 2008 From: dan at wikia-inc.com (Dan Lewis) Date: Mon, 28 Jul 2008 10:44:00 -0400 Subject: [Search-l] Interesting Article In-Reply-To: <488C8E46.3050508@wikia.com> References: <488A24AC.4020900@apache.org> <488C8E46.3050508@wikia.com> Message-ID: <6704a5e60807280744l465e7a99y35eb77693156df59@mail.gmail.com> Assuming it is right -- the 1 trillion pages, I mean -- I find it amazing that Google only indexes, say, 25% of it. What about the other 75%? It can't be all duplicate content and machine generated pages. Not even close. I figured I'd toss something on the blog about this, and ended up noticing that Google does a pretty bad job of adding blog posts to the main search engine. If you run the domains of blog hosting companies through their blog search, they tell you they see, e.g., almost 5 billion blogspot.com URLs. Run the same domain through the main search engine? 340 million. It's likely that the numbers they provide are inaccurate, but the difference here is 7 billion (!) results when you include livejournal.com and wordpress.com. http://search.wikia.com/blog/2008/07/28/whats-the-other-75-percent-blogs/ Dan On Sun, Jul 27, 2008 at 11:03 AM, Jimmy Wales wrote: > Dennis Kubes wrote: > > http://googleblog.blogspot.com/2008/07/we-knew-web-was-big.html > > Weird. > > 1 trillion unique web pages. I am skeptical. That's 166 pages per > person on earth. Or, if we assume there are 1 billion people online, > that's 1,000 pages for every person online. I don't know about you, but > I haven't written 1,000 web pages yet. > > If they are data-driven pages, that's interesting and all, but > "counting" pages from a data-driven site is a bit silly. Even the blog > post acknowledges this, by talking about how a calendar site has, > theoretically, an infinite number of pages. > > --Jimbo > > _______________________________________________ > Wikia Search mailing list > http://re.search.wikia.com/ > Change options or unsubscribe: > http://lists.wikia.com/mailman/options/search-l > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.wikia.com/pipermail/search-l/attachments/20080728/38ef38c5/attachment.html From linasvepstas at gmail.com Mon Jul 28 18:56:09 2008 From: linasvepstas at gmail.com (Linas Vepstas) Date: Mon, 28 Jul 2008 13:56:09 -0500 Subject: [Search-l] Related concepts, function words, content words Message-ID: <3ae3aa420807281156o7aff64d9nd61ac1e4e6b81c92@mail.gmail.com> I just noticed something curious about google's "related topics" function. I'd been reading gmail using the web browser, and there's always a list of ads that seem to be keyed off of keywords in the email. Today, none of the ads were keyed off of keywords ... instead, they were keyed off of broad sentiment. The actual email was from my rowing coach, bitching about how people failed to show up for practice, and how that makes everyone late on the water and changes the planned workout, etc. The ads were all about employee-employer relations -- how to fire employees, how to file workplace greivances, negotiating with unions, etc. Now, nowhere in the email did it use the words "employee", "union", "grievance", "fire", "discharge" -- but somehow google perceived the overall negative tone, and that it had to do with personal relationships. Its mistake was to assume its job-related. And yet -- none of the ads were for marriage counseling or spousal abuse -- so it could tell that this was a more formal setting -- it did not mistake it for a lover showing up late for a romantic dinner (no "buy her flowers" ads), or missing out on camping with your buddies (no "how to make friends" ads). Particularly of note is that it missed the obvious sports nature of the email: content words like "rowing" "water", "workout", "practice" and "boat" were in the email, and should have given a strong positive .. and yet these were overlooked, in favour of the much more vague non-content, functional phrases like "failing to show up". --linas From aerik at thesylvans.com Tue Jul 29 04:50:29 2008 From: aerik at thesylvans.com (Aerik Sylvan) Date: Mon, 28 Jul 2008 21:50:29 -0700 Subject: [Search-l] Interesting Article In-Reply-To: <488C9AA4.90900@apache.org> References: <488A24AC.4020900@apache.org> <488C8E46.3050508@wikia.com> <488C9AA4.90900@apache.org> Message-ID: <355a36af0807282150s906a7fby6686e80992e1173e@mail.gmail.com> This was an interesting piece I saw (from the Mercury News): *The latest search engine* with ambitions to distinguish itself from Google and win a bit of mindshare is called Cuil (pronounced "cool," and BTW, can any of you marketing folks explain the wisdom of choosing a company name that needs a pronouncer?). Cuil, started by some ex-Googlers, emerged from stealth mode this morning, and at the moment, its only ambition is to survive a first day in which it was overwhelmed by both traffic and lukewarm-to-bad reviews. Among Cuil's selling points are an index of 120 billion Web pages -- three times the size of Google's, it claims -- and relevance that it says is based on the contents of a page, not its popularity. The size of the index is interesting, as well as "relevance ... based on the contents, not its popularity" ... wow, what a concept. Pagerank was brilliant and all, but (my soapbox again) I do not always want the most popular (entrenched) results. Best, Aerik On Sun, Jul 27, 2008 at 8:56 AM, Dennis Kubes wrote: > I found it interesting in that we recently passed 1 billion urls that we > know about. > > Dennis > > Jimmy Wales wrote: > > Dennis Kubes wrote: > >> http://googleblog.blogspot.com/2008/07/we-knew-web-was-big.html > > > > Weird. > > > > 1 trillion unique web pages. I am skeptical. That's 166 pages per > > person on earth. Or, if we assume there are 1 billion people online, > > that's 1,000 pages for every person online. I don't know about you, but > > I haven't written 1,000 web pages yet. > > > > If they are data-driven pages, that's interesting and all, but > > "counting" pages from a data-driven site is a bit silly. Even the blog > > post acknowledges this, by talking about how a calendar site has, > > theoretically, an infinite number of pages. > > > > --Jimbo > > > > _______________________________________________ > > Wikia Search mailing list > > http://re.search.wikia.com/ > > Change options or unsubscribe: > http://lists.wikia.com/mailman/options/search-l > _______________________________________________ > Wikia Search mailing list > http://re.search.wikia.com/ > Change options or unsubscribe: > http://lists.wikia.com/mailman/options/search-l > -- http://www.wikidweb.com - the Wiki Directory of the Web http://tagthis.info - Hosted Tagging for your website! -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.wikia.com/pipermail/search-l/attachments/20080728/29e4d50c/attachment.html From ger.dupont at gmail.com Tue Jul 29 07:12:24 2008 From: ger.dupont at gmail.com (=?ISO-8859-1?Q?G=E9rard_Dupont?=) Date: Tue, 29 Jul 2008 09:12:24 +0200 Subject: [Search-l] Interesting Article In-Reply-To: <355a36af0807282150s906a7fby6686e80992e1173e@mail.gmail.com> References: <488A24AC.4020900@apache.org> <488C8E46.3050508@wikia.com> <488C9AA4.90900@apache.org> <355a36af0807282150s906a7fby6686e80992e1173e@mail.gmail.com> Message-ID: <471965e10807290012o1c4b7a0fq47fbba2e07b62055@mail.gmail.com> I don't see the point "relevance ... based on the contents, not its popularity"... That's what most of the system do. The most basic search engine make a keyword index which is based on content. The point in page rank is that it is much more wide because it needs an analysis of links between lot of pages and maintain the graph of links. Better the big index claimed or the "relevance based on content", I prefer to talk about some minor feature (much less interesting for marketing of course). The new result layout which looks like a journal is appealing to me. At least that is a change from the classic ranked list. It contains extended snippet (or it looks like that the snippet are bigger... whatever) and they always include a small picture. IMHO that's relevant but do anyone know where are those pictures from (that not thumbnail of the result neither a picture from the result...). The category tool is also something good (but not new) which comes from enterprise search tools where the category of documents is much more important. They also have an interactive query suggestion tool which is quite relevant for a one day old engine. Classically, query suggestion is based on past queries... Either they already have existing data (or the one day old data) either they use some other approach. Finally, I did not try the engine enough (neither made any benchmark) to claim that is a good or a bad engine. But do anyone made such study ? gdupont 2008/7/29 Aerik Sylvan > This was an interesting piece I saw (from the Mercury News): > > *The latest search engine* with ambitions to distinguish itself from > Google and win a bit of mindshare is called Cuil (pronounced "cool," and BTW, can any of you marketing folks explain the > wisdom of choosing a company name that needs a pronouncer?). Cuil, started > by some ex-Googlers, emerged from stealth mode this morning, and at the > moment, its only ambition is to survive a first day in which it was > overwhelmed by both traffic and lukewarm-to-bad > reviews. > > Among Cuil's selling points are an > index of 120 billion Web pages -- three times the size of Google's, it > claims -- and relevance that it says is based on the contents of a page, not > its popularity. > > > The size of the index is interesting, as well as "relevance ... based on > the contents, not its popularity" ... wow, what a concept. Pagerank was > brilliant and all, but (my soapbox again) I do not always want the most > popular (entrenched) results. > > > Best, > > Aerik > > > On Sun, Jul 27, 2008 at 8:56 AM, Dennis Kubes wrote: > >> I found it interesting in that we recently passed 1 billion urls that we >> know about. >> >> Dennis >> >> Jimmy Wales wrote: >> > Dennis Kubes wrote: >> >> http://googleblog.blogspot.com/2008/07/we-knew-web-was-big.html >> > >> > Weird. >> > >> > 1 trillion unique web pages. I am skeptical. That's 166 pages per >> > person on earth. Or, if we assume there are 1 billion people online, >> > that's 1,000 pages for every person online. I don't know about you, but >> > I haven't written 1,000 web pages yet. >> > >> > If they are data-driven pages, that's interesting and all, but >> > "counting" pages from a data-driven site is a bit silly. Even the blog >> > post acknowledges this, by talking about how a calendar site has, >> > theoretically, an infinite number of pages. >> > >> > --Jimbo >> > >> > _______________________________________________ >> > Wikia Search mailing list >> > http://re.search.wikia.com/ >> > Change options or unsubscribe: >> http://lists.wikia.com/mailman/options/search-l >> _______________________________________________ >> Wikia Search mailing list >> http://re.search.wikia.com/ >> Change options or unsubscribe: >> http://lists.wikia.com/mailman/options/search-l >> > > > > -- > http://www.wikidweb.com - the Wiki Directory of the Web > http://tagthis.info - Hosted Tagging for your website! > > _______________________________________________ > Wikia Search mailing list > http://re.search.wikia.com/ > Change options or unsubscribe: > http://lists.wikia.com/mailman/options/search-l > -- G?rard Dupont Information Processing Competence Center (IPCC) - EADS DS http://weblab-project.org Perception & Machine Learning team - LITIS Laboratory -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.wikia.com/pipermail/search-l/attachments/20080729/068e8238/attachment.html From linasvepstas at gmail.com Tue Jul 29 14:21:36 2008 From: linasvepstas at gmail.com (Linas Vepstas) Date: Tue, 29 Jul 2008 09:21:36 -0500 Subject: [Search-l] Interesting Article In-Reply-To: <471965e10807290012o1c4b7a0fq47fbba2e07b62055@mail.gmail.com> References: <488A24AC.4020900@apache.org> <488C8E46.3050508@wikia.com> <488C9AA4.90900@apache.org> <355a36af0807282150s906a7fby6686e80992e1173e@mail.gmail.com> <471965e10807290012o1c4b7a0fq47fbba2e07b62055@mail.gmail.com> Message-ID: <3ae3aa420807290721x1cf7cea2mefc0b026d9ba4118@mail.gmail.com> 2008/7/29 G?rard Dupont : > I don't see the point "relevance ... based on the contents, not its > popularity"... That's what most of the system do. Its certainly not what page rank does. Page rank is a pure popularity contest. The success of Page rank depends on the hope that humands have previously identified the "good pages" and made them popular. This is also why page-rank can be fooled: use bots and bogus web sites to create artificial "popular" pages, and page-rank will rank them highly, even if the content sucks. Finding some mechanism that outperforms pagerank is no mean feat. > and they always include a small picture. IMHO that's relevant but do anyone > know where are those pictures from (that not thumbnail of the result neither > a picture from the result...). I dunno. I ego-surfed for my own name, and found pictures of ugly people who I have never seen before. Yechh. How people I've never seen before are associated with projects I've worked on, I don't know. --linas From ger.dupont at gmail.com Tue Jul 29 15:06:54 2008 From: ger.dupont at gmail.com (=?ISO-8859-1?Q?G=E9rard_Dupont?=) Date: Tue, 29 Jul 2008 17:06:54 +0200 Subject: [Search-l] Interesting Article In-Reply-To: <3ae3aa420807290721x1cf7cea2mefc0b026d9ba4118@mail.gmail.com> References: <488A24AC.4020900@apache.org> <488C8E46.3050508@wikia.com> <488C9AA4.90900@apache.org> <355a36af0807282150s906a7fby6686e80992e1173e@mail.gmail.com> <471965e10807290012o1c4b7a0fq47fbba2e07b62055@mail.gmail.com> <3ae3aa420807290721x1cf7cea2mefc0b026d9ba4118@mail.gmail.com> Message-ID: <471965e10807290806y17af2b37ya9fdd139b7802cbc@mail.gmail.com> 2008/7/29 Linas Vepstas > 2008/7/29 G?rard Dupont : > > I don't see the point "relevance ... based on the contents, not its > > popularity"... That's what most of the system do. > > Its certainly not what page rank does. Page rank is a > pure popularity contest. The success of Page rank > depends on the hope that humands have previously > identified the "good pages" and made them popular. > This is also why page-rank can be fooled: use bots > and bogus web sites to create artificial "popular" pages, > and page-rank will rank them highly, even if the content sucks. Actually the ranking model of google is much more deeper than page rank which is only the visible part of the iceberg. For most search google provide quite good results still subject to critics, but still better than most engines. They not only use link to compute page rank but authority sites are defined in a way that try to avoid the whole bots things. Still subject to critics but not so bad. > Finding some mechanism that outperforms pagerank > is no mean feat. > > > and they always include a small picture. IMHO that's relevant but do > anyone > > know where are those pictures from (that not thumbnail of the result > neither > > a picture from the result...). > > I dunno. I ego-surfed for my own name, and found pictures > of ugly people who I have never seen before. Yechh. How > people I've never seen before are associated with projects > I've worked on, I don't know. Exactly what I found too but indeed I found that allow to quickly identify non relevant pages and sometimes also the relevant ones. I really want to know how they decide to present such image or not... > > > --linas > _______________________________________________ > Wikia Search mailing list > http://re.search.wikia.com/ > Change options or unsubscribe: > http://lists.wikia.com/mailman/options/search-l > -- G?rard Dupont Information Processing Competence Center (IPCC) - EADS DS http://weblab-project.org Perception & Machine Learning team - LITIS Laboratory -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.wikia.com/pipermail/search-l/attachments/20080729/a05e8f31/attachment.html From dan at wikia-inc.com Tue Jul 29 15:11:49 2008 From: dan at wikia-inc.com (Dan Lewis) Date: Tue, 29 Jul 2008 11:11:49 -0400 Subject: [Search-l] Knol Proves the Importance of Transparency Message-ID: <6704a5e60807290811n30c3be1btafcc7489928e9bdb@mail.gmail.com> That's the topic of today's blog post :) I'm pasting it below -- but it's originally at http://search.wikia.com/blog/2008/07/29/knol-proves-the-importance-of-transparency/ D *** >From day one of the Wikia Search project, the Wikia Search community collectively brainstormed the core principles of the project and, indeed, that search currently lacks and needs. One of them ? transparency ? is needed now more than ever. Last week, Google released it's new content endeavor, Knol ? a platform by which anyone can write up a page about a topic, invite others to help, and make some pocket change using Google's AdWords platfrom. Google, of course, also makes some coin off those AdWords ads, and with the long tail working to their benefit, can cash in big time. The only trick? How to get traffic to all these new Knol pages. Well, they happen to have a pretty big search engine ? and already, some people are noticing that Knol entries tend to do well in Google Search results . Jason Calacanis is probably the perfect person to point out the flaw ? he's no fan of ours here at Wikia Search (snif!) and admits that he's a "Google man" who "love[s] the Google", but even he is concerned: Yesterday, his screen shot of "how to backpack"made its wa around the web, replete with an ominous tag line, showing that in just five days, a Knol made it to the top of the relevant search result. But let's face it, Google is not going to re-write their algorithms to favor Knol. It'd be mindbogglingly idiotic to do, and more importantly, unnecessary. Why? Because Knol already has an advantage that you, I, and the rest of the non-Google world don't have ? access to Google's search team, and to that algorithm itself. For most people ? us average beings ? Google recommends that you work with a Search Engine Optimization specialist ("SEO"). No, not explicility, but read that page and you'll see that (a) they don't directly answer the question as to whether one should hire an SEO and (b) very little is on point on that page in general. The small part that is says this: A great time to hire [an SEO] is when you're considering a site redesign, or planning to launch a new site. That way, you and your SEO can ensure that your site is designed to be search engine-friendly from the bottom up. However, a good SEO can also help improve an existing site. But the fact is that SEOs do not know, exactly, how Google's algorithm works. Only one company does: Google itself. And at some point ? if it has not happened already ? someone from the Google Search team and someone from the Google Knol team will get together and give Knol an big lesson in SEO. Maybe it will be an explicit, high-level decision. Maybe it will just be two people, one from each team, sitting down for lunch with the Search guy saying "hey, if you want to give your stuff a boost, do ." Maybe it happened six months ago. Maybe it will happen in three years. Who knows? All we know is two things: 1. It's only possible because Google hides their algorithm from the non-Google world. If everyone could do it, Knol would have no appreciable advantage. 2. It's inevitable. Even if Google's corporate powers-that-be mandate that the two groups not mix nor mingle, the knowledge that flows through those halls will be impossible to shutter. The solution? Open source that algorithm, and everyone ? include Knol ? is on a level playing field. All accusations of impropriety go away, and the inevitable occurrence of Team Knol benefiting from private lessons with Team Search are instantly moot. Transparency. Search demands it. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.wikia.com/pipermail/search-l/attachments/20080729/5d83d7df/attachment.html