Here you will find basic information about the application and its features. Also, in the application itself, next to many elements you can find the icon. Clicking on it will give you an interactive hint.
- Microsoft Windows 7 and above
- .NET Framework 4.5.2 and above
- Opened firewall ports 80 and 443
- 1Gb of free HDD space and above
- 1Gb of free RAM space and above
- 2-cores CPU and above
- Stable internet connection
Aplication licensing is built around license keys. You can get a key on the website and activate it in the downloaded application. Also, you can get demo key.
Note that the license keys are bound to the current dates. The activation of the key is performed at the moment of purchase. Thus, the license validity period is calculated from the moment of key purchase, not from the moment of its activation in the application.
When the license key have been expired, application will be blocked for scraping.
Please note that the application may contact the website to verify the authenticity of your license key. Also, each license key has a limit of 3 activations. If for any reason you have reached the limit, but believe that the restriction was unfairly imposed – please contact us to correct the problem.
There are two principal ways to crawl and scrape any website: .NET crawling and CEF crawling. Let’s see it’s differences:
- CEF Crawling it’s a crawling with usage of Chromium Embedded Framework, based on Chromium web browser. This way of scanning actually emulates the work with the site through the real web-browser. It allows to scan web pages with dynamic content, as well as to interact with scanned pages by means of custom JS-scripts. However, this method is slower than native crawling because it uses a large number of libraries and waits for the page to be fully loaded and rendered.
We recommend using the CEF crawling as the primary method of scanning web pages. This is more flexible way to scan pages and it is suitable for most websites. You can make this setting in the “Crawler settings” section.
First of all, you need to create a new Data Excavator Task and complete all settings. In addition, you can set all the default settings. Note that many settings contain a help button.
You can choose between creating a standard project and an express project. The Express Project will immediately ask you to enter .CSS selectors of the elements you want to extract from the website. In general, if you know what .CSS selectors are, there is no special difference between an express project and an ordinary project, in which case it is better to use an ordinary project.
Once you have created your project, simply click the Start button on the project card.
If you are using an express project, simply enter the link to the product page and click “Automatically detect .CSS selectors”. After that, our application will try to automatically identify 3 .CSS selectors for this page and offer them to you. If the application fails to do this, you will have to specify several .CSS selectors for this page yourself.
You can also use the ready-made projects library. In this case, find the site you are interested in and click on “Create a project”.
You can easily import and export project settings using the corresponding buttons in the settings section. You can also use the default settings as a starting point for a new project. We use the .json format to import and export data.
If you want to import settings, press “File” -> “Import project from file”
You import and export project settings from the project card. Select any project you want to operate and press on the “Settings” button. Then, use “Import” and “Export” buttons.
You can also use the library of ready-made templates in the section “Ready-made scraping templates”. There we publish our copyright settings for different sites. New templates are distributed with application updates.
Our application is internally divided into a scraping module and a scanning module. The scraping module is responsible for extracting data. The scanning module is responsible for downloading pages from a website. If you want to extract data only from certain categories, you need to set the settings both inside the scanning module and inside the scanning module.
A. Setting up scanning module (Crawler settings)
By default, the application will analyze the entire site. This means that all pages will be downloaded and processed.
To limit the application to only the selected categories, use the settings as shown in the illustrations below.
In this example, we have set the settings for the page download server. Accordingly, the application will consistently analyze the pages of the site. From each page, the application will extract a list of links. From these links, it will follow only those links that contain the lines “dp/” and “n%3a16225007011%2cn%3a172456”.
At the same time, the application will NOT follow links that contain substrings “.css, .js, assets, .png, .jpg, .jpeg, .bmp, .exe, .msi, cart, register, ap/, images-na, help/, customer/, redirect, account, blog, reviews”.
B. Setting up scraping module (Data to scrape)
So, we figured out how the scan server works. Now we move on to the scraping server. This is the adjacent settings tab.
The scraping server has the “URL mask” setting. In this setting you should specify substrings for those pages from which you want to extract data. If you want to extract data from all pages downloaded by the crawling server, set the value to *. If the target page has some substring that you can explicitly track, it is better to use this string. This way you can avoid many blank pages from which you do not need information.
Our application uses .CSS selectors and XPath-expressions to extract data. If you do not know how .CSS selectors or XPath works, it is likely that you will find it difficult to understand our application. However, this is not difficult to understand.
In general, data extraction with our application works as follows. You specify a list of blocks to be extracted from the site. You can do this in the “Data to scrape” section. All blocks are specified using .CSS selectors or XPath-expressions.
In the same section you specify the URL mask, where the data should be extracted. For example, if all pages on the site (from which you want to extract data) contain a string “dp/” (like Amazon.com), then you need to specify this string. Our application will only retrieve data from pages whose URL contains this line.
You can export the collected data in different formats. There are several formats to choose from – .xlsx, .csv, .sql, .json. The most convenient and complete is the .json format – it is close in its structure to the internal file format, and allows you to present the information in the simplest form.
When exporting data, you get a file (or set of files) with text data, as well as a folder with binary data, which can include images, archives and other data, filled with collected data and configured in the system for downloading and analysis.
To export data, use the corresponding menu item in the project block.
After that, select the necessary data to export and click “Export checked results”.
After exporting, you will receive a data file and an image folder. We offer several file formats to choose from – .json / .csv / .xlsx / .mysql. The most detailed format is JSON – it contains all data in the most detailed form.
The basic unit of the system is the project. Each project has several property groups. The main property groups are the crawler settings, grabber settings, and pattern list. Settings of the crawler affect the server that downloads the source code of pages and work with the site through http(s) protocols, downloading binary files. The grabber settings affect the server that extracts data from the downloaded pages. The list of patterns is a set of rules by which the grabber extracts the contents of pages.
Note – in the settings window there is a button “Set to defaults”, which resets all properties to a typical state.
Also, the settings window contains buttons “Import settings” and “Export settings”. These buttons upload all settings to a file or upload all settings from a file.
You can change the settings and try to test the project. There are a lot of settings, and you will have to deal with some of them, especially when it comes to complex sites. If something goes wrong, you can always reset the settings to their default values.
In some cases, you may need a set of proxy servers to collect data from the site correctly. Sometimes the site is not available in your location, and sometimes the site actively monitors user behavior and blocks user activity by IP address. In such situations, a proxy server can be used to perform actions on your behalf. Our application supports proxy servers both for native data scanning and for CEF scanning.
You can define several proxy servers that will be used to scan the site. In this case, a distinctive feature of our application is the ability to rotate proxy servers during the scanning process. You can choose several ways of proxy rotation – random rotation, consecutive rotation, absence of rotation.
Our application can analyze the contents of the robots.txt file. At the same time, you can set whether you should consider the rules inside Robots.txt. If the setting is set to true, the robots.txt file will be downloaded and analyzed. At the same time, the application will reload and reanalyze the contents of the file every N days, in accordance with the settings.
Note that our application uses the “Robots.txt U-a chain” setting when analyzing the Robots.txt file. This parameter has a standard value – “Googlebot,Yandex,*,YandexBot”. This means that the file will be analyzed sequentially from under each value of the User-Agent parameter in the list. The first found block will be used as a set of valid rules. Thus, if the file contains a Google directive, only the rules from the Google block will be applied. If there is a Yandex directive in the file and there is no Google directive, then the block of rules from the Yandex block will be applied.
Sometimes you may need to execute your own JS code on the pages of websites. Moreover, in most cases it is very important for quality data scraping.
For example, if you worked with amazon.com or aliexpress.com, you know that part of the necessary information (including all photos of the product) is encrypted in the source code of the pages. In order to retrieve all the information you need, you have to go to the page via JS code and process this information before scraping.
In any case, regardless of the purpose of using JS scripts, our application provides this possibility. Go to the circumvention server settings and edit the corresponding setting.
In JS handler settings you can specify the main script, and the secondary script. Our application will go through all steps sequentially (Step 1 – Step 4). If the page has a complex structure (e.g. endless scrolling), you should use “JS script 2”, and you will be able to flip through the page below and below in the loop. If you need to execute some simple preparation script (one time), use “JS-script 1”.
In some cases, you may need to log in with your username and password before you start scraping. Our application provides this option. Open the Crawler settings and use the appropriate .
In this window you need to specify the settings for authentication. Here you can find the JS-script, which is responsible for authentication. In it, using .CSS-selectors, you can access the fields “login” and “password” and press the “login” button.
In the column “is logged on check” there is a line, the presence or absence of which will determine whether you work under your account in the system or not. If after loading the next page our application does not find this line in the source code of the page, it will try to login.
Our system allows extracting data from iframe blocks. In order to use this feature, set the appropriate setting. After that, address the iframe blocks with selectors as if they were ordinary page elements.
You can export data online using http or https protocol. The corresponding export settings are available in Grubber’s settings. The function of exporting data and Http(s)-requests is implemented in the Grabbing server
The data will be sent through a $_POST array. The information itself will be represented as JSON array. Pictures will be presented as base64-strings.
Unfortunately, this feature is not in our application yet. If some website uses CAPTCHA, you can’t extract data from it using DataExcavator. We are currently working on implementing this feature. In the nearest future we plan to finish this work and implement CAPTCHA processing.
If you bought our license key on the tariff “Standard” or “Enterprice”, write in the feedback form – we will make you the settings for the site.
If you are using a key demo or promotional version of the key – unfortunately, we will not be able to make settings on a free basis. If you want to make settings on a paid basis – please contact us and we will offer you some price for these services.