General information about using the application. We recommend that you read our tips and general information. This can be useful for your projects and understanding of the application.

1. System software requirements

  • OS: Microsoft Windows 7+ and above
  • Installed .NET Framework 4.5.2
  • ~300 Mb space on HDD
  • ~300 Mb RAM and above
  • Stable internet connection
  • Firewall: opened port 80, 443

2. System hardware requirements

  • CPU with 4 cores and above
  • More than 1Gb free RAM memory
  • Some space on HDD for grabbed data saving

3. Application usage

The application is designed to extract data from websites and export these data to some structured format. You can export the data into .xlsx, .csv, .json, .sql formats. You can collect any data from websites, such as product cards, counterparty cards, ad data, competitor information and much more. The application allows you to collect data in a multi-threaded style, both directly using your IP-address, and through a set of proxy servers.

First of all, you need to create a new Data Excavator Task and complete all settings. In addition, you can set all the default settings. Note that many settings contain a help button. Once you have created your project, simply click the Start button to search and capture the target site’s data.

4. Licensing terms

Aplication licensing is builded around license keys. You can got license key on website and activate it in downloaded application. Also, you can got demo key.

Note that the license keys are bound to the current dates. The activation of the key is performed at the moment of purchase. Thus, the license validity period is calculated from the moment of key purchase, not from the moment of its activation in the application.

When license will expires, application will blocked for primary actions – crawling and grabbing.

Purchased license keys are sent to customers via e-mail. If you have any problems with your keys, please contact us.

Please note that the application may contact the website to verify the authenticity of your key. Also, each license key has a limit of 3 activations. If for any reason you have reached the limit, but believe that the restriction was unfairly imposed – please contact us to correct the problem.3

5. Application architecture

Application is written in pure multithreaded style. The upper level structure (as well as the structure of the base unit) is the Data Excavator Task, which encapsulates access to all other modules of the system. Each task isolates all the threads around the crawling and grabbing processes. This means that each task includes its own crawler server and its own grabber server.

Let’s list all the principal modules:

  • Data Excavator task - base module for storing all other components
  • Crawling server - crawl data from website, fetch links from crawled pages
  • Scraping (grabbing) server - extracts data from crawled pages
  • Logging server - logs system events and flushes logs to HDD
  • Some set of micro-threading-servers, like links server, IO server and others.

The crawling server scans the links from the common queue, receives the source codes of the pages, and then adds the scan results to the waiting queue. The grabbing server sequentially retrieves the data from the waiting queue, searches for data patterns according to the next page from queue, and then parses the page according to the set of selectors for that page. The data found is written to special files or sent via http-request via some link.

6. Import and export settings

You can easily import and export project settings using the corresponding buttons in the settings section. Please note that the site provides a library of ready-made settings for popular sites. You can also use the default settings as a starting point for a new project. We use the .json format to import and export data.

Projects settings libraryhttps://data-excavator.com/settings-lib/

7. Main ways to crawl websites

There is two principial ways to crawl and extract data – .NET Native crawling and CEF Crawling. Let’s see it’s differences:

  • .NET (Native) crawling it's a crawling with usage of inner .NET classes, like a HTTPWebRequest and someothers. This is a quick way to scan web pages, as it works with the source code of the page that is downloaded at the time the web server is accessed. Accordingly, this method is limited only to sites that have static content that is displayed at the time the web pages are accessed. This method is NOT suitable for sites that have dynamic content displayed via ajax or javascript.
  • CEF Crawling it's a crawling with usage of Chromium Embeeded Framework, based on Chromium web browser. CEF Crawling is a scanning process using the Chromium Embeeded Framework based on the Chromium web browser. This way of scanning actually emulates the work with the site through the web-browser. It allows to scan web pages with dynamic content, as well as to interact with scanned pages by means of CEF Behaviors (user scripts or special conditions). However, this method is slower than native scanning because it uses a large number of libraries and waits for the page to be fully loaded and rendered.

We recommend using the .NET (native) search as the primary method of scanning web pages. This is a faster and more flexible way to scan pages and is suitable for most websites.

8. Main ways to grab data (data selectors)

Our application extracts data from the pages of sites with the help of selectors. The selector is a conditional set of rules that indicate to the program a certain HTML-container located on a web page. Accordingly, having defined a set of selectors for some project, you will get a ready-made template with the help of which the program will extract data from pages. Without the use of selectors you will not be able to extract data from the pages. Thus, you should deal with the principle of their work.

There are two principal ways to determine the data received from the site. The first way is to get data through .css selectors. The second way is to get data through XPath expressions. We recommend you to start using the application with .css-selectors – it is a simple and clear way to get acquainted with the algorithms of the program and understand the principle of data extraction. You can read more about css-selectors in the corresponding article.

As an example of the use of selectors, we will give the following situation. Suppose there is some element with id=”myelement” on the page. In order to point out this element to the program and make the program take away the contents of this element, you can use the .css-selector #myelement. The rest of the program will do for you by itself – it will address to the page, download its contents, pull out the necessary value of the element and save the result in the corresponding file.

9. Extracted data exporting

You can export the collected data in different formats. There are several formats to choose from – .xlsx, .csv, .sql, .json. The most convenient and complete is the .json format – it is close in its structure to the internal file format, and allows you to present the information in the simplest form.

When exporting data, you get a file (or set of files) with text data, as well as a folder with binary data, which can include images, archives and other data, filled with collected data and configured in the system for downloading and analysis.

When exporting data, you can choose whether you want to export text within the collected elements or HTML codes of these elements together with the content. Correspondingly, when selecting text you will receive only text, and when selecting HTML-codes you will receive only external HTML-codes of the found elements.

To export data, use the corresponding menu item in the project block.

After that, select the necessary data to export and click “Export checked results”.

Speaking about the process of exporting data, it is necessary to explain two fundamental features of the project – the core of the excavator is based on the principle of historicity of the collected data and the ability to collect data from their multidimensional arrays of elements.

By historicity we mean the life cycle of your data collection projects – namely, today you can collect data from some project by one set of rules (template), and tomorrow you can collect data from the same project by another set of data. Today you can collect data from the conditional field “description”, and tomorrow you can not collect data from this field, but collect data from another field – “short description”. Accordingly, today you can have one set of results, and tomorrow – another. However, the two sets of data may differ not only in structure, but also in field names, template names, and other parameters. The IO-core of the Excavator preserves the historicity of the settings of all templates according to the history of changes in these settings. Every time you change the settings of a project (i.e. change something in the Grabbing Patterns section), a copy of the old template is saved on disk, remaining linked to the data already collected. This allows you to change the template settings without having to worry about exporting the data. This is important because templates are actively involved in the process of exporting data – in the file with the results you see a comparison of the required fields from the templates with the results obtained from the site. Accordingly, while exporting data (depending on the type of the exported file and the number of changes in templates), the system uses old templates to upload the collected data that were bound to these templates. If we are talking about exporting in files of .csv type, a separate .csv file will be created for each template on which any data was collected. If we are talking about exporting data in .xlsx files, a separate sheet will be created for each template on which any data has been collected. If we are talking about exporting data in files of the .sql type, a set of columns will be added to each template, according to the set of fields inside the template. Thus, the export of data, regardless of the number of template changes, will save all the information you have collected without distortion.

The possibility of collecting data from multidimensional arrays – this feature is built into the system, because not all data is presented on sites in a linear way. By linearity we mean the conventional principle of “One entity” – “One page”. Linear pages include such pages as “Product Page”, “Organization Card”, “Page of a person in a social network”, and so on. Multi-dimensional arrays are pages with lists of data – for example, “List of offers for sale”, “List of service offices”, “List of employees of the organization”, “List of exchange quotations”, and so on. Basically, a multidimensional data array is a set of entities that do not have separate pages for each of them. Accordingly, our system uses so-called “External selectors” to work with such pages. The external selector defines a set of rules, according to which the system distinguishes one entity card from another, and allows you to get as a result of the full set of this, and information about each of the entities. When we talk about exporting, we mean that the Excavator kernel supports multidimensional arrays. If you work with linear arrays, you will get the principle of “One page” – “One entry” in the export results. If you work with multidimensional arrays, you can get the principle “One page” – “Several records”.

In any case, we believe that the easiest export format from the point of view of all the features of the system is .json. It allows you to take into account all the possible situations around the collected data, and to present a set of results in the most understandable and objective way. In other cases, the exported results may differ from each other in the number of files, field names and column structure.

10. Projects settings

The basic unit of the system is the project. Each project has several property groups. The main property groups are the crawler settings, grabber settings, and pattern list. Settings of the crawler affect the server that downloads the source code of pages and work with the site through http(s) protocols, downloading binary files. The grabber settings affect the server that extracts data from the downloaded pages. The list of patterns is a set of rules by which the grabber extracts the contents of pages.

Note – in the settings window there is a button “Set to defaults”, which resets all properties to a typical state.

Also, the settings window contains buttons “Import settings” and “Export settings”. These buttons upload all settings to a file or upload all settings from a file.

11. Crawler: using proxy servers

In some cases, you may need a set of proxy servers to collect data from the site correctly. Sometimes the site is not available in your location, and sometimes the site actively monitors user behavior and blocks user activity by IP address. In such situations, a proxy server can be used to perform actions on your behalf. Our application supports proxy servers both for native data scanning and for CEF scanning.

You can define several proxy servers that will be used to scan the site. In this case, a distinctive feature of our application is the ability to rotate proxy servers during the scanning process. You can choose several ways of proxy rotation – random rotation, consecutive rotation, absence of rotation. We would like to remind you that our application is made in multithreaded style – pages are scanned in parallel mode by several threads. Interaction with the pool of proxy servers is carried out in a single way – for example, in case of consecutive rotation of proxy servers they will be changed sequentially even taking into account the multithreaded access to them by the thread pool. Smart work with proxy servers is one of the special features of our application.

12. Crawler: analysing robots.txt, U-a chaining

Our application can analyze the contents of the robots.txt file. At the same time, you can set whether you should consider the rules inside Robots.txt. If the setting is set to true, the robots.txt file will be downloaded and analyzed. At the same time, the application will reload and reanalyze the contents of the file every N days, in accordance with the settings.

Note that our application uses the “Robots.txt U-a chain” setting when analyzing the Robots.txt file. This parameter has a standard value – “Googlebot,Yandex,*,YandexBot”. This means that the file will be analyzed sequentially from under each value of the User-Agent parameter in the list. The first found block will be used as a set of valid rules. Thus, if the file contains a Google directive, only the rules from the Google block will be applied. If there is a Yandex directive in the file and there is no Google directive, then the block of rules from the Yandex block will be applied.

13. Crawler: CEF Behaviors

CEF Behaviors is a set of handlers designed for use in the process of downloading data from web pages. Very often we encounter situations when a page has some features that do not allow you to download its content linearly – these pages include sites with dynamic content, sites with authorization, sites with endless scrolling, sites with the button “Show more” and many others. In other words, “Page with features” is a page whose content requires interaction and response from the program. For correct downloading of content of such pages we use CEF Behaviors sets.

CEF Behavior – this is the assembly of some logic, applicable to a certain page of the site. When you use CEF-scanning of data, you work through Chromium environment, and have an opportunity of interaction with pages of the site in different ways. Correspondingly, the CEF-flow, comparing the next page through PageUrlSubstringPattern, applies a set of logic inside the CEF assembly, after which it analyzes the results or calls the event “Source code of the page is received and processed”.

Let’s consider the main CEF Behavior fields:

  • WaitAfterPageLoaded_InSeconds_Step1 - waiting time in seconds, which the thread will wait after the initial download of the page.
  • JSScriptToExecute_Step2 - JS-code, which will be executed after Step1.
  • WaitAfterpageLoaded_InSeconds_Step3 - waiting time in seconds, which the thread will expect after the execution of Step2.
  • LeavePageRule - the condition of completing the page processing, which will be checked after Step3.
  • JSScriptToExecuteAfterPageHTMLCodeGrabbed - JS-script, which will be executed after the check of LeavePageRule if the condition setted to "LeavePageAfterJSEventReturnsSomeResult"
  • LeavePageRuleValue - the value at which the page processing is finished. The value is checked by the LeavePageRule rule.

Let’s consider possible values for CEFCrawlingPageLeaveEventType (checking by LeavePageRule ):

  • 0 = LeavePageAfterIndexing - the thread will finish page executing without checking LeavePageRule
  • 1 = LeavePageAfterSomeTimeSpentInSeconds - the thread will finish page executing after N seconds spent. It's possibly that it will many itterations to cover entered time interval.
  • 2 = LeavePageAfterJSEventReturnsSomeResult - the thread will finish processing the page after the JSScriptToExecuteAfterPageHTMLCodeGrabbed user script returns the value set in LeavePageRuleValue
  • 3 = LeavePageAfterNLinksParsed - thre thread will finish processing the page after N links will founded on the page.
  • 4 = LeavePageAfterNoNewLinksParsed - the thread will finish processing the page after no new links will founded.

The overall process of browsing through the CEF thread and CEF behavior will look like this:

  1. Navigate Chromium to some page by next URL
  2. Lookup for CEF behavior by URL pattern
  3. [If the behavior was found:] Waiting for a few seconds as defined in WaitAfterPageLoaded_InSeconds_Step1
  4. [If the behavior was found:] Execute JS script if it defined in JSScriptToExecute_Step2
  5. [If the behavior was found:] Waiting for a few seconds as defined in WaitAfterpageLoaded_InSeconds_Step3
  6. Get page source HTML code
  7. [If the behavior was found:] Checking that we can finish page execution (by LeavePageRule and LeavePageRuleValue)
  8. [If the behavior was found:] [If LeavePageRule=LeavePageAfterJSEventReturnsSomeResult:] Executing JS script from JSScriptToExecuteAfterPageHTMLCodeGrabbed and checking if the result matches the LeavePageRuleValue value.

14. Crawler - potencial problems

The page scanning server has a certain set of potential problems which can arise in the process of using the program. These problems arise from different extreme situations or a combination of factors that occur quite rarely. Nevertheless, it is necessary to understand these factors and try to avoid their occurrence. Let’s consider typical problems of a scanning server.

  • Scanning pages of any site can hang or work very slowly when working with incorrect proxy servers. This is due to the fact that both native scanning and CEF scanning are waiting for a response from the proxy server in the hope of receiving a response. If the proxy server runs very slowly or incorrectly, the scan streams will run very slowly as well. If the proxy server turns off at some point - your streams will also stop scanning the target site.
  • Sites block your ip address when scanning hard. Some sites actively monitor the behavior of site visitors. In doing so, your activity within the site and the number of requests for a specified period of time are monitored. If you download too much content from a website without using a proxy server, sooner or later your computer's or server's IP address will be blocked and scanning will stop.
  • Using incorrect User-agent headers. Many sites have the functionality of analyzing User-agent headers. Accordingly, such sites often adjust the appearance of the site pages to the target browsers. If you use a non-standard or unusual User-agent, some sites will not give correct content or will block your behavior in accordance with security policies.
  • Scanning of iframe content inside the destination page. At the moment you can only scan iframe content through direct access to iframe content. However, the program does not yet support inline scanning of iframes, which is part of the destination page. In future versions we plan to support iframe scanning which is part of the destination page.
  • Scanning through Chromium can potentially hang in some situations. Chromium in our project works through the CEFSharp layer which uses the CEF layer which accesses the Chromium kernel. In various situations, CEF threads can potentially hang without adequate reasons. We are working on eliminating potential problems with rare (but occurring) hangs of CEF threads. However, these situations are rather rare and it is quite possible that you will not encounter them.

15. How to made grabbing pattern

Grabbing pattern – set of rules for extracting data from downloaded pages. Actually, the grabbing pattern is a few general settings and an array of named selectors. 

The general settings include the name of the pattern, a list of URL strings (to define the pages on which it can be used), and information about the external selector (if necessary).  

Inside the “Parsing elements” list, you need to list all the selectors you want to use to extract data from some page.  For example, if you want to extract “Price”, “Name” and “Description” parameters from some page of the product, then in the “Parsing elements” block you should define 3 corresponding elements – “Price”, “Name” and “Description”, in each of which you should specify the selector, by which the data will be extracted.

When you enter the “Grabber patterns” section, you will see a window for creating a list of patterns, not for entering a single pattern. Why does it work like this?  Because our application allows you to create several patterns at once for data extraction and use them in parallel while working. For example, you want to extract several types of data from the site at once – there is an “Organization card” and “Employee card of an organization” on the site that have different design and appearance. In this case, in the “Grabbing patterns” window you should define two patterns – one for “Organization Card” and the other for “Employee Card of an organization”. With this setting, the application downloads from the site and correctly extracts data from both entities. In general, if you want to extract several different entities from a certain site, we recommend you to use several patterns instead of several projects – in such a situation you download the content of the sites once and analyze it in several data schemes. Otherwise (if you decide to use multiple projects), you will have to download the content several times.

Once you have defined the general pattern rules, you will need to make a list of selectors that will indicate which data should be taken from the downloaded page. The selector is a pointer that tells the program where to extract some data from the page. We recommend using .css-selectors – this is an easy and convenient technology. You can learn more about making .css-selectors in the corresponding article.

Extracting data from some site, you may need to extract images or binary files as well. In this case, you should make settings in the section “Parsing attrs list” – they are responsible for downloading binary files through arbitrary attributes of DOM elements. As a rule, any binary data is embedded into HTML page by means of attributes (for images it is src attribute, for different media files it can be data-*). Enter information about which attributes you want to download in order for the application to perform the appropriate actions.

16. Outer selectors - what is it

External selectors are a tool for working with multidimensional arrays of information on web pages. Using external selectors allows you to organize the set of information received from the sites and simplify further work with this information.

According to our conditional classification, any web pages can be divided into two categories – one-dimensional or multidimensional. To one-dimensional pages we refer such pages, which contain one essence – a business card of the person, contact information of the organization, the page of the sold goods and the like. Multi-dimensional pages, on the contrary, contain a lot of information of the same type, grouped in some blocks – for example, history of records, or all organizations in a certain region, or any information lists of data.

It is for work with multidimensional arrays and invented “external selectors”. Using external selectors, you indicate to the system that you are going to work with multidimensional page, and as a result you need to get a multidimensional array with information about each record from the page. An external selector is a .css or x-path expression that points to some HTML node in the tree that is external to the data being collected.

17. Export data online

You can export data online using http or https protocol. The corresponding export settings are available in Grubber’s settings. The function of exporting data and Http(s)-requests is implemented in the Grabbing server.

The DataExcavator link you specify will transmit the data immediately after it has been extracted by Grubber. Specification of transmitted data:

  • Request type: $_POST
  • Data format: JSON
  • BLOB format: Base64
  • Request fields: task-name, parsed-data, parsed-binary
  • Specification of "parsed-data" field: [ { "PageUrl" => "[STRING]Page URL", "GrabDateTime" => "[STRING]Date and time when page was grabbed", "PatternName" => "[STRING]Corresponding pattern name", "GrabbedData" => [ [ { "PatternItemName" => "[STRING]Corresponding pattern element name", "Data" => [ { "GrabbedData" => "[STRING]HTML or Text", "Attributes" => [ { "AttrName" => "[STRING]Attribute name", "AttrValue" => "[STRING]Attribute value", "AttrFileName" => "[EMPTY]" }, { ... }, ... { ... } ] },{ ... }, ... {... } ] }, { ... }, ... { ... } ], [ ... ], ... [ ... ] ] ], { ... }, ... { ... } ]
  • Specification of "parsed-binary" field: [ { "BinaryDataGUID" => "[STRING]GUID", "AttributeName" => "[STRING]src|href|data-*|...", "AttributeValue" => "[STRING]Some attribute value", "DataSize" => "[INTEGER](Kb)", "DataContent" => "[STRING(Base64)]" => "Base64 encoded string" }, { ... }, ... { ... } ]
Scroll to top
  • Sign up
Lost your password? Please enter your username or email address. You will receive a link to create a new password via email.