How Web Crawlers Work / myLot

India

December 19, 2006 3:56am CST

A web crawler (also known as a web spider or web robot) is a program or automated script which browses the internet seeking for web pages to process. Many applications mostly search engines, crawl websites everyday in order to find up-to-date data. Most of the web crawlers save a copy of the visited page so they could easily index it later and the rest crawl the pages for page search purposes only such as searching for emails ( for SPAM ). How does it work? A crawler needs a starting point which would be a web address, a URL. In order to browse the internet we use the HTTP network protocol which allows us to talk to web servers and download or upload data from and to it. The crawler browses this URL and then seeks for hyperlinks (A tag in the HTML language). Then the crawler browses those links and moves on the same way. Up to here it was the basic idea. Now, how we move on it completely depends on the purpose of the software itself. If we only want to grab emails then we would search the text on each web page (including hyperlinks) and look for email addresses. This is the easiest type of software to develop. Search engines are much more difficult to develop. When building a search engine we need to take care of a few other things. 1. Size - Some web sites are very large and contain many directories and files. It may consume a lot of time harvesting all of the data. 2. Change Frequency – A web site may change very often even a few times a day. Pages can be deleted and added each day. We need to decide when to revisit each site and each page per site. 3. How do we process the HTML output? If we build a search engine we would want to understand the text rather than just treat it as plain text. We must tell the difference between a caption and a simple sentence. We must look for bold or italic text, font colors, font size, paragraphs and tables. This means we must know HTML very good and we need to parse it first. What we need for this task is a tool called "HTML TO XML Converters". One can be found on my website. You can find it in the resource box or just go look for it in the Noviway website: www.Noviway.com. That's it for now. I hope you learned something.

1 person likes this

2 responses