[Search-l] URL Normalization and Input

Balinny balinny at gmail.com
Thu Jan 31 23:39:27 UTC 2008


Dennis Kubes wrote:
> We are currently working on URL normalization measures for the search 
> wikia crawls.  URL normalization is used during crawls to change URLs 
> into standard forms.  An example of this is have www.site.com/index.html 
> and www.site.com/ resolve to the same URL for crawling and scoring purposes.
>
> Eventually the idea would be to allow normalizations on a per domain 
> basis and allow the community to give detailed feedback per domain. 
> Currently all normalizations are on a global basic.  Our current url 
> normalizations are done through regex so I have included the current 
> expressions as well.  Currently we have come up with the following 
> normalizations, is there anything else we should include, change?  What 
> does everyone think?
>   
I hope it's only done for url comparing purposes to avoid duplicate 
results) not for crawl. The remove-index is specially dangerous.
We should rely on the Content-Location to detect the index as the same as /

Also, i don't know how they're currently done, but the order may matter. 
I'd do it in this order:
-Remove #...
-Remove session ids
-Clean &s
-Remove ?&var
-Trailing ?
-Change default pages into standard

Some suggestions:
-Remove maxage and smaxage parameters for comparing.
-Add php5 to the extensions list (although if they're putting the 
version in the extension, it's probably NOT the default).
-The ending with ([^/]*)$ instead of $ doesn't make me feel too comfortable.
-Use ETags




More information about the Search-l mailing list