<br><br><div><span class="gmail_quote">On 31/01/2008, <b class="gmail_sendername">Dennis Kubes</b> <<a href="mailto:kubes@apache.org">kubes@apache.org</a>> wrote:</span><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
We are currently working on URL normalization measures for the search<br>wikia crawls. URL normalization is used during crawls to change URLs<br>into standard forms. An example of this is have <a href="http://www.site.com/index.html">www.site.com/index.html</a><br>
and <a href="http://www.site.com/">www.site.com/</a> resolve to the same URL for crawling and scoring purposes.<br><br>Eventually the idea would be to allow normalizations on a per domain<br>basis and allow the community to give detailed feedback per domain.<br>
Currently all normalizations are on a global basic. Our current url<br>normalizations are done through regex so I have included the current<br>expressions as well. Currently we have come up with the following<br>normalizations, is there anything else we should include, change? What<br>
does everyone think?</blockquote><div><br>1. Session id elimination.<br> <br> Is this wise? If you eliminate the session id from a URL then the server is<br> likely to respond with a completely different page that represents the start<br>
of a new session. This may result in the user seeing something quite <br> different to what the search engine sees which may be confusing. <br><br>2. Pages served for directory requests.<br><br> I have also seen "welcome.htm[l]" and "home.htm[l]" used in this context.<br>
However I think doing this sort of normalisation is unwise and agree with<br> Balinny that you need to access the actual pages. However a typically<br> configured Apache server will not return a "Content-Location" header, it<br>
will simply and silently return "index.html" when you request the directory.<br> [I think, but I'm not sure, that MS IIS does return a "Content-Location" header.]<br><br> So I'm afraid you'll have to fetch both "/" and "/index.html" and determine<br>
that they're the same by checksumming or content inspection.<br><br>3. Removal of fragments (the bit after the #)<br><br> Yes, of course, but remember, and the regex quoted doesn't, that this may<br>
interact with the dynamic part of the URL (the bit after the "?"). I would<br> assume that the fragment part of the URL is terminated by the "?" if a<br> dynamic part (or query) is present.<br>
<br>4. Collapse multiple ampersands (&amp;) to a single ampersand.<br><br> Can't see why.<br><br>5. Removal of initial & after ?<br><br> I.e. ?&var=.... -> ?var=....<br><br> OK if you really want to, personally I'd prefer to ensure that there was<br>
an ampersand in this position as it makes it slightly easier to parse the<br> dynamic part if you really want to.<br><br>6. Remove trailing ?<br><br> If it's all on its own, seems sensible - but check with server, it just might<br>
do something insufferably clever.<br><br>Some extra suggestions<br><br>7. Encoded characters<br><br> Map + to space (RFC1630), Map hex encoded non-reserved characters to<br> their non-encoded equivalents. [E.g. %7e -> ~ see RFC3986]<br>
<br>8. Leading double dots etc.,<br><br> Do something coherent with URLs that start .././ and the like. Again see RFC3986<br> for detailed discussion. This is associated with the process I call derelativisation,<br>
i.e. converting a relative URL to an absolute URL.<br><br>9. Care with case<br><br> If the server is Unix/Linux based then URL case must preserved since the underlying<br> file naming system is case sensitive. On Microsoft based servers file naming is not<br>
case sensitive, so if server signature analysis (the Server header) suggests a <br> Microsoft based host then MYPAGE.HTM and mypage.htm can be regarded as<br> being the same.<br><br>10. Order of dynamic parts.<br>
<br> In general the order of the variable settings in the dynamic part of a URL is<br> unimportant. I.e. bigsite?&chap=23&page=11 and bigsite?&page=11&chap=23<br> will both refer to the same document. This requires parsing the dynamic part<br>
and comparing the sequence.<br><br><br>General point.<br><br> I think it would be better to retain all the URLs in the database and associate<br> an arbitrary document identification with them. So 2 (or more) URLs that redirect to<br>
or refer to the same document will retain their distinctiveness but will all be associated<br> with the same document. This mechanism can support both HTTP and meta tag<br> redirection.<br><br></div><br>
</div><br>