[Search-l] URL Normalization and Input
Peter Burden
peter.burden at gmail.com
Fri Feb 1 13:10:22 UTC 2008
On 31/01/2008, Dennis Kubes <kubes at apache.org> wrote:
>
> We are currently working on URL normalization measures for the search
> wikia crawls. URL normalization is used during crawls to change URLs
> into standard forms. An example of this is have www.site.com/index.html
> and www.site.com/ resolve to the same URL for crawling and scoring
> purposes.
>
> Eventually the idea would be to allow normalizations on a per domain
> basis and allow the community to give detailed feedback per domain.
> Currently all normalizations are on a global basic. Our current url
> normalizations are done through regex so I have included the current
> expressions as well. Currently we have come up with the following
> normalizations, is there anything else we should include, change? What
> does everyone think?
1. Session id elimination.
Is this wise? If you eliminate the session id from a URL then the
server is
likely to respond with a completely different page that represents the
start
of a new session. This may result in the user seeing something quite
different to what the search engine sees which may be confusing.
2. Pages served for directory requests.
I have also seen "welcome.htm[l]" and "home.htm[l]" used in this
context.
However I think doing this sort of normalisation is unwise and agree
with
Balinny that you need to access the actual pages. However a typically
configured Apache server will not return a "Content-Location" header,
it
will simply and silently return "index.html" when you request the
directory.
[I think, but I'm not sure, that MS IIS does return a
"Content-Location" header.]
So I'm afraid you'll have to fetch both "/" and "/index.html" and
determine
that they're the same by checksumming or content inspection.
3. Removal of fragments (the bit after the #)
Yes, of course, but remember, and the regex quoted doesn't, that this
may
interact with the dynamic part of the URL (the bit after the "?"). I
would
assume that the fragment part of the URL is terminated by the "?" if
a
dynamic part (or query) is present.
4. Collapse multiple ampersands (&) to a single ampersand.
Can't see why.
5. Removal of initial & after ?
I.e. ?&var=.... -> ?var=....
OK if you really want to, personally I'd prefer to ensure that there
was
an ampersand in this position as it makes it slightly easier to parse
the
dynamic part if you really want to.
6. Remove trailing ?
If it's all on its own, seems sensible - but check with server, it
just might
do something insufferably clever.
Some extra suggestions
7. Encoded characters
Map + to space (RFC1630), Map hex encoded non-reserved characters to
their non-encoded equivalents. [E.g. %7e -> ~ see RFC3986]
8. Leading double dots etc.,
Do something coherent with URLs that start .././ and the like. Again
see RFC3986
for detailed discussion. This is associated with the process I call
derelativisation,
i.e. converting a relative URL to an absolute URL.
9. Care with case
If the server is Unix/Linux based then URL case must preserved since
the underlying
file naming system is case sensitive. On Microsoft based servers file
naming is not
case sensitive, so if server signature analysis (the Server header)
suggests a
Microsoft based host then MYPAGE.HTM and mypage.htm can be regarded
as
being the same.
10. Order of dynamic parts.
In general the order of the variable settings in the dynamic part of
a URL is
unimportant. I.e. bigsite?&chap=23&page=11 and
bigsite?&page=11&chap=23
will both refer to the same document. This requires parsing the
dynamic part
and comparing the sequence.
General point.
I think it would be better to retain all the URLs in the database and
associate
an arbitrary document identification with them. So 2 (or more) URLs
that redirect to
or refer to the same document will retain their distinctiveness but
will all be associated
with the same document. This mechanism can support both HTTP and meta
tag
redirection.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.wikia.com/pipermail/search-l/attachments/20080201/add24bcf/attachment.html
More information about the Search-l
mailing list