[Search-l] Not just searching

peter burden peter.burden at gmail.com
Tue May 1 23:21:59 UTC 2007


Hello,
    Glad to see this list has woken up.
    I   think there's more to a good search engine than indexing and 
ranking. It is also
    important to crawl a good representative set of pages. Writing a 
good crawler
    isn't as simple as it might seem.

    Are we going to crawl dynamic pages, break frame sets, get Ajax 
content, handle
    PDF, Power Point, MS Word etc., etc. What about detection of 
duplicate pages and
    alias hosts? What about links buried in Javascript? What about those 
dynamically
    generated calendars of the month's events, before you can blink the 
crawler has
    got to July 2344 !

    There are many other traps for the unwary crawler writer.

    Will the crawler, which has to parse pages to find links, also put 
pages into a standard
    form for use by the indexer? What about content in different 
languages and character
    sets? Will the searcher distinguish Québec from Quebec?

    These may seem tactical issues compared with some of the strategic 
ideas being
    mentioned but crawling parameters need to be considered - or users 
will be dissatisfied
    with poor coverage, dead links and duplication.




More information about the Search-l mailing list