[Search-l] URL Normalization and Input
Dennis Kubes
kubes at apache.org
Fri Feb 1 00:59:22 UTC 2008
Balinny wrote:
> Dennis Kubes wrote:
>> We are currently working on URL normalization measures for the search
>> wikia crawls. URL normalization is used during crawls to change URLs
>> into standard forms. An example of this is have www.site.com/index.html
>> and www.site.com/ resolve to the same URL for crawling and scoring purposes.
>>
>> Eventually the idea would be to allow normalizations on a per domain
>> basis and allow the community to give detailed feedback per domain.
>> Currently all normalizations are on a global basic. Our current url
>> normalizations are done through regex so I have included the current
>> expressions as well. Currently we have come up with the following
>> normalizations, is there anything else we should include, change? What
>> does everyone think?
>>
> I hope it's only done for url comparing purposes to avoid duplicate
> results) not for crawl. The remove-index is specially dangerous.
> We should rely on the Content-Location to detect the index as the same as /
That is one of the reasons I wanted to get some feedback, is it
dangerous because the opinion is that there are many non-default pages
called index.html or equivalent or is it dangerous because of spam
implications. The intention was to use this during generation of urls
to fetch and when parsing links from pages.
Nutch has a dedup process that eliminates both duplicate urls and
duplicate content by hash. I am more concerned about duplicate pages
for crawling and more importantly scoring. For instance if you do this
search:
http://re.search.wikia.com/search#java
You will see that java.net and www.java.net are currently counted as
different urls but are duplicates with differing scores.
>
> Also, i don't know how they're currently done, but the order may matter.
> I'd do it in this order:
> -Remove #...
> -Remove session ids
> -Clean &s
> -Remove ?&var
> -Trailing ?
> -Change default pages into standard
>
> Some suggestions:
> -Remove maxage and smaxage parameters for comparing.
I am not understanding what these are? Just query parameters?
> -Add php5 to the extensions list (although if they're putting the
> version in the extension, it's probably NOT the default).
will do
> -The ending with ([^/]*)$ instead of $ doesn't make me feel too comfortable.
The ([^/]) is to make sure things like wiki/index.php/Main_Page don't
get changed to wiki//Main_Page
> -Use ETags
I don't know if Nutch supports ETags yet or not, if not it is definitely
something that is needed. :)
>
> _______________________________________________
> Wikia Search mailing list
> http://alpha.search.wikia.com/
> Change options or unsubscribe: http://lists.wikia.com/mailman/options/search-l
More information about the Search-l
mailing list