For example, the temperature of each city on each day is available on the internet; however you cannot analyze the temperature distribution or trends unless all of this data is captured into a single database.
In this case, a targeted crawler could automate the collection of data from sites such as weather.com and automatically save the temperature by date and city. Our intelligent crawler would learn the structure of the target website and automatically capture data for all of the cities. Our targeted crawlers are even capable of grabbing other parameters such as region or country, and storing them in a properly structured hierachy. Now, with a detailed historical temperature database, you can do all kinds of analysis!
In addition to downloading information from the public domain, crawlers also work as general purpose data transformers. We have developed crawlers that convert data among various database systems, Excel spreadsheets, XML feeds and plain text files. These data conversions often involve schema transformations, too.
Writing a crawler can be challenging; it is often regarded as a black art. Websites are generated by many different serverside technologies such as JSP, .NET or PHP frameworks. In addition to having an intimate knowledge of the inner workings of each language, crawler writers must also attempt to reverse the original author's intent.
A crawling system typically consists of the following components:
- Content Fetching
- Content Scraping
- Browser Action Emulation
- Site Traversal
- Storage Structuring
- Entity Reconciliation
When we deliver crawlers, we also enable our customers the sustained ability to acquire and consume external information.