[Search-l] Not just searching
peter burden
peter.burden at gmail.com
Tue May 1 23:21:59 UTC 2007
Hello,
Glad to see this list has woken up.
I think there's more to a good search engine than indexing and
ranking. It is also
important to crawl a good representative set of pages. Writing a
good crawler
isn't as simple as it might seem.
Are we going to crawl dynamic pages, break frame sets, get Ajax
content, handle
PDF, Power Point, MS Word etc., etc. What about detection of
duplicate pages and
alias hosts? What about links buried in Javascript? What about those
dynamically
generated calendars of the month's events, before you can blink the
crawler has
got to July 2344 !
There are many other traps for the unwary crawler writer.
Will the crawler, which has to parse pages to find links, also put
pages into a standard
form for use by the indexer? What about content in different
languages and character
sets? Will the searcher distinguish Québec from Quebec?
These may seem tactical issues compared with some of the strategic
ideas being
mentioned but crawling parameters need to be considered - or users
will be dissatisfied
with poor coverage, dead links and duplication.
More information about the Search-l
mailing list