[Search-l] index update and request for seeds

jer jeremie at jabber.org
Sat Apr 12 01:24:00 UTC 2008


The index was refreshed again this week and rolled out yesterday  
(thanks Seth and Dennis!), it was a crawl of the top ~25M pages based  
on a source of over 800M URLs.  Result quality has some minor  
improvements, but more significantly this update enabled language  
detection (see http://wiki.apache.org/nutch/ 
LanguageIdentifierPlugin).  The results page doesn't take advantage of  
this for any automatic handling yet, but you can add a lang:?? to any  
search as a filter. For the fun of it, I did a search of all the ISO  
two-letter language codes and got the total results for each, it's  
attached below.

Seeds: the index is built and served with Nutch and we're preparing a  
compressed copy to put up for download via bittorrent, it should be  
about 290GB. If anyone has that much space and a decent connection and  
would be willing to help seed this just email me directly and I'll get  
you going with a copy.

Also, besides lang:, here's a list of the other filters this index  
supports:
	url: (in the url)
	anchor: (in the incoming anchor text)
	content: (specifically just in the content)
	title: (title of the page)
	host: (just in the hostname)

Jer

AA	7
AB	46
AF	626
AM	21
AR	15265
AS	3
AY	0
AZ	212
BA	31
BE	283
BG	4868
BH	0
BI	0
BN	50
BO	4
BR	223
CA	14187
CO	7
CS	74257
CY	311
DA	149332
DE	1439097
DZ	1
EL	13071
EN	9602332
EO	1640
ES	611212
ET	1112
EU	917
FA	4239
FI	54938
FJ	0
FO	46
FR	791189
FY	9
GA	97
GD	111
GL	1053
GN	5
GU	35
HA	2
HE	10609
HI	896
HR	3591
HU	43842
HY	96
IA	13
IE	14
IK	1
ID	326
IS	45142
IT	595428
IU	0
JA	265533
JV	1
KA	43
KK	7
KL	1
KM	8
KN	27
KO	3070
KS	0
KU	58
KY	1
LA	184
LN	0
LO	6
LT	4633
LV	1519
MG	6
MI	234
MK	151
ML	10
MN	15
MO	1
MR	9
MS	72
MT	12
MY	8
NA	11
NE	116
NL	390734
NO	131595
OC	25
OM	4
OR	0
PA	12
PL	351849
PS	4
PT	180741
QU	1
RM	14
RN	0
RO	11080
RU	302048
RW	0
SA	63
SD	0
SG	9
SH	36
SI	153
SK	8127
SL	1486
SM	2
SN	0
SO	12
SQ	133
SR	1190
SS	1
ST	3
SU	0
SV	102920
SW	48
TA	91
TE	12
TG	1
TH	10203
TI	1
TK	3
TL	22
TN	0
TO	3
TR	32021
TS	1
TT	13
TW	1664
UG	0
UK	2394
UR	74
UZ	4
VI	341
VO	0
WO	0
XH	0
YI	7
YO	0
ZA	17
ZH	36847
ZU	37





More information about the Search-l mailing list