ExcavatorSharp - C# web data scraping library

ExcavatorSharp technical documentation

Docs v1.0 from 22.01.2020

Please note that nuget x64 build is intended to run strictly on x64 platforms. If you are looking for an x86 build, please use our website to directly download the x86 library. The build is not cross-platform (AnyCPU) based on the peculiarities of Chromium Embedded Framework.

After installing ExcavatorSharp, you should pay attention to CEF-related packages. For correct work the following packages are required: cef.redist.x86, cef.redist.x64, CefSharp.Common, CefSharp.OffScreen. Important! For correct compilation of the program you should choose between x86 and x64 build. You cannot support building of AnyCPU because the CEF build architecture requires an explicit specification of the application architecture – x86 or x64.

After installing all the libraries we recommend you to reboot VisualStudio – and this is not a joke. CEF builds do not respond well to packet connections and often do not work after installation via nuget. To complete the installation correctly, you often need to reboot Visual Studio and rebuild the project.

1. System requirements

The library has the following dependencies and requirements:

2. Library licensing

By default, the library is provided with a free key that allows you to have a limit of up to 10 projects, 2 threads for each project. To expand the number of projects or threads, you need to purchase a commercial license.

Note that the first time you initialize a license key, the library will contact the http address of http://data-excavator.com/ to validate the key. However, since the library is a data scraper and is intended to extract data from web pages, we believe that this is fully acceptable behavior to protect our interests and copyright in the library. Please note that the library does not transmit any data to us except for sending the key by which you initialized it.

To initialize the library, you must use the standard demo key, or the key received from us. The standard key:

Vbk4eQWp8kdmqnl2QlzWkBWIQzu++xD6yEwYB68SiEFVSOyRL0fEB0T7XlheB93/rRWdFtsnoHeiUu0WcVYHqqCZzPq0s0APf6KkND8B3N6ZL0yZ+vQsvvTCFf+SYADDMcW1RQLKHh+r03w+BOulu6nCM0sSHGDtqSiUGpjSa1RA4DLdSnGW7pTbXuqI2CGs

3. Basic usage example

The basic unit of data scraping is the DataExcavatorTask class. It is an isolated scraping task. Inside it is an isolated scanning server, an isolated parsing server and its own buffers. Your project may have several tasks, each of which will work in parallel with a set of its own threads. Tasks can be started and stopped separately from each other.

The DataExcavatorTasksFactory class is used for storing tasks. It is also responsible for exporting data from a task. Before writing any code related to data scraping, you should initialize Chromium Embedded Framework components. Before closing the program, you must uninitialize these components. Otherwise, data parsing through CEF will not work correctly. Initialize CEF is done by calling CEFSharpFactory.InitializeCEFBrowser() CEF deinitialization is done by calling CEFSharpFactory.ShutdownCEF(). Ideally, the CEFSharpFactory method should be called just before closing the program.

To start a scraping task, call the DataExcavatorTask.StartTask() method, which accepts the Action delegate that will be called after the task is successfully started.

public class ESHelloWorld
{
public void ParseWalmartAnyPage()
{
//1. Initialize tasks storage 
CEFSharpFactory.InitializeCEFBrowser();
DataExcavatorTasksFactory TasksFactory = new DataExcavatorTasksFactory();
TasksFactory.InitializeExcavator("Vbk4eQWp8kdmqnl2QlzWkBWIQzu++xD6yEwYB68SiEFVSOyRL0fEB0T7XlheB93/rRWdFtsnoHeiUu0WcVYHqqCZzPq0s0APf6KkND8B3N6ZL0yZ+vQsvvTCFf+SYADDMcW1RQLKHh+r03w+BOulu6nCM0sSHGDtqSiUGpjSa1RA4DLdSnGW7pTbXuqI2CGs");
//2. Create your first task
DataExcavatorTask NewTask = new DataExcavatorTask(
"My first scraping task",
"https://www.walmart.com/",
"Scrape some data from walmart",
new List() {
new DataGrabbingPattern() {
PatternName = "Apple ipads pattern",
AllowedPageUrlsSubstrings = new string[] { "ip/" },
GrabbingItemsPatterns= new List() {
new DataGrabbingPatternItem("Item name", new GrabberSelector("h1.prod-ProductTitle", DataGrabbingSelectorType.CSS_Selector)),
new DataGrabbingPatternItem("Item price", new GrabberSelector(".prod-PriceSection .display-inline-block-m .price-characteristic", DataGrabbingSelectorType.CSS_Selector))
}
}
},
new CrawlingServerProperties() { RespectRobotsTxtFile = false, RespectOnlySpecifiedUrls = new string[] { "ip/" }, PrimaryDataCrawlingWay = DataCrawlingType.NativeCrawling },
new GrabbingServerProperties(),
"C:/ExcavatorSharp/FirstProject"
);
//3. Subscribe events
NewTask.PageCrawled += NewTask_PageCrawled;
NewTask.PageGrabbed += NewTask_PageGrabbed;
//4. Add links to crawling and start task
TasksFactory.AddTask(NewTask);
NewTask.AddLinksToCrawling(new List<string>() { "https://www.walmart.com/ip/Apple-10-2-inch-iPad-7th-Gen-Wi-Fi-32GB/216119597" });
NewTask.StartTask();
}
/// Fires when some data from website been extracted
 
private void NewTask_PageGrabbed(PageGrabbedCallback GrabbedPageData)
{
Dictionary<DataGrabbingPattern, DataGrabbingResult> GrabbedResults = GrabbedPageData.GrabbingResults;
//Do what you want with grabbed data
}
/// Fires when some page from website been downloaded and parsed to structured HTML
 
private void NewTask_PageCrawled(PageCrawledCallback CrawledPageData)
{
//... just take crawled data or skip this event. Additionaly, you can prevent page grabbing if you want (CrawledPageData.PreventPageGrabbing = true;)
}
}

4. Events model

The library uses events to provide scraping results and bypass site pages. There are two basic events that you may need to use the library.

The DataExcavatorTask.PageCrawled event occurs when some page was downloaded from the site and converted to HTML object view.

The DataExcavatorTask.PageGrabbed event occurs when some page was parsed and data was extracted (successfully or unsuccessfully).

There is also an auxiliary event DataExcavatorTask.LogMessageAdded, which occurs every time some significant behavior is logged in the system. In general, you can observe how the system works using LogMessageAdded log entries.

Note that by default, in the basic case, you may even not use the event model. By default, all data scraped from websites will be saved in files and folders on your hard drive. Next, you will be able to export these data through appropriate methods. The event model allows for deeper integration of our library, but the use of events is not mandatory.

/// <summary>Subscribe system events</summary>
private void SubscribeEvents()
{ 
NewTask.PageCrawled += NewTask_PageCrawled;
NewTask.PageGrabbed += NewTask_PageGrabbed;
NewTask.LogMessageAdded += NewTask_LogMessageAdded;
}
/// <summary>Fires when some event logged</summary>
/// <param name="Callback"></param>
private void NewTask_LogMessageAdded(DataExcavatorTaskEventCallback Callback)
{
//Observe logged data
}
/// <summary>Fires when some data from website been extracted</summary>
private void NewTask_PageGrabbed(PageGrabbedCallback GrabbedPageData)
{
Dictionary<DataGrabbingPattern, DataGrabbingResult> GrabbedResults = GrabbedPageData.GrabbingResults;
//Do what you want with grabbed data
}
/// <summary>Fires when some page from website been downloaded and parsed to structured HTML</summary>
private void NewTask_PageCrawled(PageCrawledCallback CrawledPageData)
{
//... just take crawled data or skip this event. Additionaly, you can prevent page grabbing if you want (CrawledPageData.PreventPageGrabbing = true;)
}

5. Data storing model

ExcavatorSharp stores some data on your hard drive. Data are saved in different folders and files, in the parent directory defined in the DataExcavatorTask.TaskOperatingDirectory parameter. Let us examine these parameters in more detail.

Note that by default the variables are set to recommended values. Based on this model, meta-data of downloaded pages, as well as the results of page parsing are stored on the hard disk. By default, original copies of web pages are NOT stored on the hard disk. Using this model allows exporting the data some time after the parsing starts.

6. Links management

Our library allows you to interact flexibly with scanned hyperlinks. Crawling server (CrawlingServer) is responsible for link processing. Links to the site that is processed by the crawling server are stored in the CrawlingServer.CrawlingServerLinksBuffer object. This object is unavailable for external use and is limited to the ExcavatorSharp build. However, the data crawling project (DataExcavatorTask) has several methods that directly manage links.

Note that links will only be processed when CrawlingServerProperties.CrawlWebsiteLinks is set to true. For regular link crawling and re-scraping of data, the CrawlingServerProperties.ReindexCrawledPages parameter and the CrawlingServerProperties.ReindexCrawledPagesAfterSpecifiedMinutesInterval parameter should be used.

Take a look at the following example, which looks at adding links to indexing, removing links, and setting up page re-indexing.

//1. Initialize tasks storage 
CEFSharpFactory.InitializeCEFBrowser();
DataExcavatorTasksFactory TasksFactory = new DataExcavatorTasksFactory();
TasksFactory.InitializeExcavator("Vbk4eQWp8kdmqnl2QlzWkBWIQzu++xD6yEwYB68SiEFVSOyRL0fEB0T7XlheB93/rRWdFtsnoHeiUu0WcVYHqqCZzPq0s0APf6KkND8B3N6ZL0yZ+vQsvvTCFf+SYADDMcW1RQLKHh+r03w+BOulu6nCM0sSHGDtqSiUGpjSa1RA4DLdSnGW7pTbXuqI2CGs");
//2. Create some task
CrawlingServerProperties CrawlerProps = new CrawlingServerProperties();
CrawlerProps.ReindexCrawledPages = true; 
CrawlerProps.ReindexCrawledPagesAfterSpecifiedMinutesInterval = 10; // Set all pages recrawl time to 10 minutes
CrawlerProps.LinksBufferHDDAutoSavingMilliseconds = 5000; // Flush links buffer to HDD every 5 seconds
DataExcavatorTask NewTask = new DataExcavatorTask(
"Task1", 
"https://www.walmart.com", 
"Some description", 
new List<DataGrabbingPattern>(), 
CrawlerProps, 
new GrabbingServerProperties(), 
"C:/Walmart1"
);
//3. Add links to crawling and start task 
NewTask.AddLinksToCrawling(new List<string>() { "https://www.walmart.com/ip/Apple-10-2-inch-iPad-7th-Gen-Wi-Fi-32GB/216119597" });
//4. Remove links from crawling
NewTask.DeleteLinksFromCrawling(new List<string>() { "http://www.walmart.com/io/Apple-iphone-xs" });
//5. Force link recrawling
NewTask.ForceLinkRecrawling("https://www.walmart.com/ip/Apple-10-2-inch-iPad-7th-Gen-Wi-Fi-64-gb");

7. Exporting scraped data

As we have already written, by default the program does not require you to use an event model or deep integration, nor does it require a deep understanding of the algorithms. By default, the library will simply run with a set of parameters, after which it will start retrieving data and saving it to your hard drive. Next, you can export this data in a convenient format at any time.

The following example shows how you can export data and do without an event model.

/// <summary>
/// Data export testing
/// </summary>
public class ESExportExample
{
public void ParseAndExportWalmartAnyPage()
{
//1. Initialize tasks storage 
CEFSharpFactory.InitializeCEFBrowser();
DataExcavatorTasksFactory TasksFactory = new DataExcavatorTasksFactory();
TasksFactory.InitializeExcavator("Vbk4eQWp8kdmqnl2QlzWkBWIQzu++xD6yEwYB68SiEFVSOyRL0fEB0T7XlheB93/rRWdFtsnoHeiUu0WcVYHqqCZzPq0s0APf6KkND8B3N6ZL0yZ+vQsvvTCFf+SYADDMcW1RQLKHh+r03w+BOulu6nCM0sSHGDtqSiUGpjSa1RA4DLdSnGW7pTbXuqI2CGs");
//2. Create your first task
DataExcavatorTask NewTask = new DataExcavatorTask(
"My first scraping task",
"https://www.walmart.com/",
"Scrape some data from walmart",
new List<DataGrabbingPattern>() {
new DataGrabbingPattern() {
PatternName = "Apple ipads pattern",
AllowedPageUrlsSubstrings = new string[] { "ip/" },
GrabbingItemsPatterns= new List<DataGrabbingPatternItem>() {
new DataGrabbingPatternItem("Item name", new GrabberSelector("h1.prod-ProductTitle", DataGrabbingSelectorType.CSS_Selector)),
new DataGrabbingPatternItem("Item price", new GrabberSelector(".prod-PriceSection .display-inline-block-m .price-characteristic", DataGrabbingSelectorType.CSS_Selector))
}
}
},
new CrawlingServerProperties() { RespectRobotsTxtFile = false, RespectOnlySpecifiedUrls = new string[] { "ip/" }, PrimaryDataCrawlingWay = DataCrawlingType.NativeCrawling },
new GrabbingServerProperties(),
"C:/ExcavatorSharp/FirstProject"
);
//3. Add links to crawling and start task
TasksFactory.AddTask(NewTask);
NewTask.AddLinksToCrawling(new List<string>() { "https://www.walmart.com/ip/Apple-10-2-inch-iPad-7th-Gen-Wi-Fi-32GB/216119597" });
NewTask.StartTask();
//... Wait some time - 30 seconds, for example ...
Thread.Sleep(30000);
//Exporting by dates period
NewTask.ExportAllGrabbedData(
"C:/ExportPath",
DataExportingFormat.XLSX,
DataExportingType.OuterHTML,
",",
new DateTime(2019, 10, 10),
DateTime.Now
);
//Exporting by selected meta data entries. Files with metadata will be generated by GrabbingServer
List<GrabbedPageMetaInformationDataEntry> AllParsedDataSavedOnHDD = NewTask.GetGrabbedDataListOverview();
NewTask.ExportSelectedGrabbedData(
"C:/ExportPath2",
DataExportingFormat.JSON,
DataExportingType.InnerText,
",",
AllParsedDataSavedOnHDD
);
} 
}

What if you want to export the data online? Our library allows it. There is an appropriate setting to support this operation. When you use this export, the data will be sent immediately as it appears in the parsing results buffer.

/// <summary>
/// Data export online
/// </summary>
public class DataExportingOnlineTest
{
public void ParseAndExportWalmartAnyPage()
{
//1. Initialize tasks storage 
CEFSharpFactory.InitializeCEFBrowser();
DataExcavatorTasksFactory TasksFactory = new DataExcavatorTasksFactory();
TasksFactory.InitializeExcavator("Vbk4eQWp8kdmqnl2QlzWkBWIQzu++xD6yEwYB68SiEFVSOyRL0fEB0T7XlheB93/rRWdFtsnoHeiUu0WcVYHqqCZzPq0s0APf6KkND8B3N6ZL0yZ+vQsvvTCFf+SYADDMcW1RQLKHh+r03w+BOulu6nCM0sSHGDtqSiUGpjSa1RA4DLdSnGW7pTbXuqI2CGs");
//2. Create some task
GrabbingServerProperties GrabberProps = new GrabbingServerProperties();
GrabberProps.ExportDataOnline = true;
GrabberProps.ExportDataOnlineInvokationLink = "https://your-website/take-parsing-results.php";
DataExcavatorTask NewTask = new DataExcavatorTask(
"Task1",
"https://www.walmart.com",
"Some description",
new List<DataGrabbingPattern>(),
new CrawlingServerProperties(),
GrabberProps,
"C:/Walmart1"
);
}
}

8. Scraping images and binary content

/// <summary>
/// Extract data from img tags
/// </summary>
class ImagesScraping
{
public void ScrapeImages()
{
//1. Initialize tasks storage 
CEFSharpFactory.InitializeCEFBrowser();
DataExcavatorTasksFactory TasksFactory = new DataExcavatorTasksFactory();
TasksFactory.InitializeExcavator("Vbk4eQWp8kdmqnl2QlzWkBWIQzu++xD6yEwYB68SiEFVSOyRL0fEB0T7XlheB93/rRWdFtsnoHeiUu0WcVYHqqCZzPq0s0APf6KkND8B3N6ZL0yZ+vQsvvTCFf+SYADDMcW1RQLKHh+r03w+BOulu6nCM0sSHGDtqSiUGpjSa1RA4DLdSnGW7pTbXuqI2CGs");
//2. Create some pattern
DataGrabbingPattern NewPattern = new DataGrabbingPattern();
NewPattern.AllowedPageUrlsSubstrings = new string[] { "page1" };
NewPattern.PatternName = "Grab images from item page";
NewPattern.GrabbingItemsPatterns = new List<DataGrabbingPatternItem>();
//3. Target to image attribute - download data from SRC tag and save to file
DataGrabbingPatternItem PatternItem = new DataGrabbingPatternItem(
"Image tag", 
new GrabberSelector("img.item-image", DataGrabbingSelectorType.CSS_Selector), 
true, 
new ParsingBinaryAttributePattern[] {
new ParsingBinaryAttributePattern("src", true, true)
} 
); 
}
}

9. Proxies support

Our library supports data proxying. You can define the proxy pool to be used for data scanning. In this case, each time you finish downloading the current page, the crawling server will use a specific proxy from this pool. The choice of proxy will depend on the rotation algorithm. You can use a sequential rotation algorithm, or a random rotation (random) algorithm.

/// <summary>
/// Proxies testing
/// </summary>
class ProxiesSupport
{
public void TestProxy()
{
//1. Initialize tasks storage 
CEFSharpFactory.InitializeCEFBrowser();
DataExcavatorTasksFactory TasksFactory = new DataExcavatorTasksFactory();
TasksFactory.InitializeExcavator("Vbk4eQWp8kdmqnl2QlzWkBWIQzu++xD6yEwYB68SiEFVSOyRL0fEB0T7XlheB93/rRWdFtsnoHeiUu0WcVYHqqCZzPq0s0APf6KkND8B3N6ZL0yZ+vQsvvTCFf+SYADDMcW1RQLKHh+r03w+BOulu6nCM0sSHGDtqSiUGpjSa1RA4DLdSnGW7pTbXuqI2CGs");
//2. Define proxy list
CrawlingServerProperties CrawlerProps = new CrawlingServerProperties();
CrawlerProps.ProxiesRotation = ProxiesRotationType.SequenciveRotation;
CrawlerProps.HTTPWebRequestProxiesList = new List<DataCrawlingWebProxy>();
CrawlerProps.HTTPWebRequestProxiesList.Add(new DataCrawlingWebProxy("127.0.0.1", 601));
CrawlerProps.HTTPWebRequestProxiesList.Add(new DataCrawlingWebProxy("192.168.0.101", 450, true, "UserName", "UserPassword"));
//3. Create task
DataExcavatorTask NewTask = new DataExcavatorTask(
"Task1",
"https://www.walmart.com",
"Some description",
new List<DataGrabbingPattern>(),
CrawlerProps,
new GrabbingServerProperties(),
"C:/Walmart1"
);
}
}

10. Crawling ways

There are two principal ways of data crawling. The first method is native data crawling using standard .NET (Sockets) tools. The second way is dynamic crawling using Chromium Embedded Framework. In the first case, the content is downloaded more quickly, but you can not download dynamic content or interact with the site through JavaScript. In the second case, you get a full web browser at your disposal, but this is a slower way to scan sites.

/// <summary>
/// Setup crawling way
/// </summary>
class CrawlingWays
{
public void SetupCrawlingWay()
{
//1. Initialize tasks storage 
CEFSharpFactory.InitializeCEFBrowser();
DataExcavatorTasksFactory TasksFactory = new DataExcavatorTasksFactory();
TasksFactory.InitializeExcavator("Vbk4eQWp8kdmqnl2QlzWkBWIQzu++xD6yEwYB68SiEFVSOyRL0fEB0T7XlheB93/rRWdFtsnoHeiUu0WcVYHqqCZzPq0s0APf6KkND8B3N6ZL0yZ+vQsvvTCFf+SYADDMcW1RQLKHh+r03w+BOulu6nCM0sSHGDtqSiUGpjSa1RA4DLdSnGW7pTbXuqI2CGs");
//2. Define primary crawling way
CrawlingServerProperties CrawlerProps = new CrawlingServerProperties();
//Use .NET Sockets to download pages (dynamic content not supported, JS-interaction not supported, more quickly crawling)
CrawlerProps.PrimaryDataCrawlingWay = DataCrawlingType.NativeCrawling;
//Use Chromium Embedded Framework to download pages (dynamic content SUPPORTED, JS-interfaction SUPPORTED, more slowly crawling)
CrawlerProps.PrimaryDataCrawlingWay = DataCrawlingType.CEFCrawling;
//3. Create task
DataExcavatorTask NewTask = new DataExcavatorTask(
"Task1",
"https://www.walmart.com",
"Some description",
new List<DataGrabbingPattern>(),
CrawlerProps,
new GrabbingServerProperties(),
"C:/Walmart1"
);
}
}

11. Interact with websites using JS

Using our library, you can interact with sites through JavaScript. Please note that the interaction is not linear (such as in Selenium). You cannot consistently perform an endless set of actions on a page. Our library, however, is closer to the search engines and its main task is to quickly scan and parse data from the site pages. If you are facing the task of narrowly parsing a small number of pages, for example – “Go to the site, enter your login and password, go to the first page, set a filter, go to another page, set another filter, click on the deitpicker, then go to the third page, click on the download” – then you will not be very convenient to use our solution. In other cases our library will be quite applicable, convenient and effective.

For interaction with pages of a site the set CEFBehaviors is used. Each element of this set defines the interaction of our library with the target site.

/// <summary>
/// Makes behaviors testing
/// </summary>
public void TestCEFBehaviors()
{
//1. Initialize tasks storage 
CEFSharpFactory.InitializeCEFBrowser();
DataExcavatorTasksFactory TasksFactory = new DataExcavatorTasksFactory();
TasksFactory.InitializeExcavator("Vbk4eQWp8kdmqnl2QlzWkBWIQzu++xD6yEwYB68SiEFVSOyRL0fEB0T7XlheB93/rRWdFtsnoHeiUu0WcVYHqqCZzPq0s0APf6KkND8B3N6ZL0yZ+vQsvvTCFf+SYADDMcW1RQLKHh+r03w+BOulu6nCM0sSHGDtqSiUGpjSa1RA4DLdSnGW7pTbXuqI2CGs");
//2. Define primary crawling way
CrawlingServerProperties CrawlerProps = new CrawlingServerProperties(); 
//Use Chromium Embedded Framework to download pages (dynamic content SUPPORTED, JS-interfaction SUPPORTED, more slowly crawling)
CrawlerProps.PrimaryDataCrawlingWay = DataCrawlingType.CEFCrawling;
//Define a set of CEF behaviors
CrawlerProps.CEFCrawlingBehaviors = new List<CEFCrawlingBehavior>();
CrawlerProps.CEFCrawlingBehaviors.Add(new CEFCrawlingBehavior(
"/item/",
10,
"$(function() { $('show-more-data').click(); });",
20,
CEFCrawlingPageLeaveEventType.LeavePageAfterJSEventReturnsSomeResult,
"function a() { if ($('page-new-links').length == 0) return 'PAGE_INDEXED'; } a();",
"PAGE_INDEXED"
));
/*
1. Navigate page with some URL contains /item/
2. Wait 10 seconds
3. Execute on page "$(function() { $('show-more-data').click(); });"
4. Wait 20 seconds
5. GET PAGE ACTUAL RENDERED HTML CODE
6. Execute on page "function a() { if ($('page-new-links').length == 0) return 'PAGE_INDEXED'; } a();"
7. Check results of a() function. If results is PAGE_INDEXED -> stop page crawling. Otherwise, go to step 2
*/
//3. Create task
DataExcavatorTask NewTask = new DataExcavatorTask(
"Task1",
"https://www.walmart.com",
"Some description",
new List<DataGrabbingPattern>(),
CrawlerProps,
new GrabbingServerProperties(),
"C:/Walmart1"
);
}