[Search-l] URL Normalization and Input

Peter Burden peter.burden at gmail.com
Fri Feb 1 13:10:22 UTC 2008


On 31/01/2008, Dennis Kubes <kubes at apache.org> wrote:
>
> We are currently working on URL normalization measures for the search
> wikia crawls.  URL normalization is used during crawls to change URLs
> into standard forms.  An example of this is have www.site.com/index.html
> and www.site.com/ resolve to the same URL for crawling and scoring
> purposes.
>
> Eventually the idea would be to allow normalizations on a per domain
> basis and allow the community to give detailed feedback per domain.
> Currently all normalizations are on a global basic.  Our current url
> normalizations are done through regex so I have included the current
> expressions as well.  Currently we have come up with the following
> normalizations, is there anything else we should include, change?  What
> does everyone think?


1.  Session id elimination.

     Is this wise? If you eliminate the session id  from a URL then the
server is
     likely to respond with a completely different page that represents the
start
     of a new session. This may result in the user seeing something quite
     different to what the search engine sees which may be confusing.

2.   Pages served for directory requests.

      I have also seen "welcome.htm[l]" and "home.htm[l]" used in this
context.
      However I think doing this sort of normalisation is unwise and agree
with
      Balinny that you need to access the actual pages. However a typically
      configured Apache server will not return a "Content-Location" header,
it
      will simply and silently return "index.html" when you  request the
directory.
      [I think, but I'm not sure, that MS IIS does return a
"Content-Location" header.]

      So I'm afraid you'll have to fetch both "/" and  "/index.html" and
determine
      that they're the same by checksumming or content inspection.

3.    Removal of fragments (the bit after the #)

       Yes, of course, but remember, and the regex quoted doesn't, that this
may
       interact with the dynamic part of the URL (the bit after the "?"). I
would
       assume that the fragment part of the URL is terminated by the "?" if
a
       dynamic part (or query) is present.

4.    Collapse multiple ampersands (&amp;) to a single ampersand.

       Can't see why.

5.    Removal of initial & after ?

       I.e. ?&var=.... ->  ?var=....

       OK if you really want to, personally I'd prefer to ensure that there
was
       an ampersand in this position as it makes it slightly easier to parse
the
       dynamic part if you really want to.

6.    Remove trailing ?

       If it's all on its own, seems sensible - but check with server, it
just might
       do something insufferably clever.

Some extra suggestions

7.    Encoded characters

       Map + to space (RFC1630), Map hex encoded non-reserved characters to
       their non-encoded equivalents.  [E.g. %7e -> ~ see RFC3986]

8.    Leading double dots etc.,

       Do something coherent with URLs that start .././ and the like. Again
see RFC3986
       for detailed discussion. This is associated with the process I call
derelativisation,
       i.e. converting a relative URL to an absolute URL.

9.    Care with case

       If the server is Unix/Linux based then URL case must preserved since
the underlying
       file naming system is case sensitive. On Microsoft based servers file
naming is not
       case sensitive, so if server signature analysis (the Server header)
suggests a
       Microsoft based host then MYPAGE.HTM and mypage.htm can be regarded
as
       being the same.

10.   Order of dynamic parts.

       In general the order of the variable settings in the dynamic part of
a URL is
       unimportant. I.e. bigsite?&chap=23&page=11 and
bigsite?&page=11&chap=23
       will both refer to the same document. This requires parsing the
dynamic part
       and comparing the sequence.


General point.

       I think it would be better to retain all the URLs in the database and
associate
       an arbitrary document identification with them. So 2 (or more) URLs
that redirect to
       or refer to the same document will retain their distinctiveness but
will all be associated
       with the same document. This mechanism can support both HTTP and meta
tag
       redirection.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.wikia.com/pipermail/search-l/attachments/20080201/add24bcf/attachment.html 


More information about the Search-l mailing list