[Search-l] URL Normalization and Input
Balinny
balinny at gmail.com
Thu Jan 31 23:39:27 UTC 2008
Dennis Kubes wrote:
> We are currently working on URL normalization measures for the search
> wikia crawls. URL normalization is used during crawls to change URLs
> into standard forms. An example of this is have www.site.com/index.html
> and www.site.com/ resolve to the same URL for crawling and scoring purposes.
>
> Eventually the idea would be to allow normalizations on a per domain
> basis and allow the community to give detailed feedback per domain.
> Currently all normalizations are on a global basic. Our current url
> normalizations are done through regex so I have included the current
> expressions as well. Currently we have come up with the following
> normalizations, is there anything else we should include, change? What
> does everyone think?
>
I hope it's only done for url comparing purposes to avoid duplicate
results) not for crawl. The remove-index is specially dangerous.
We should rely on the Content-Location to detect the index as the same as /
Also, i don't know how they're currently done, but the order may matter.
I'd do it in this order:
-Remove #...
-Remove session ids
-Clean &s
-Remove ?&var
-Trailing ?
-Change default pages into standard
Some suggestions:
-Remove maxage and smaxage parameters for comparing.
-Add php5 to the extensions list (although if they're putting the
version in the extension, it's probably NOT the default).
-The ending with ([^/]*)$ instead of $ doesn't make me feel too comfortable.
-Use ETags
More information about the Search-l
mailing list