When you start extracting data from some website, first of all you need to specify a list of .css or x-path selectors. In this case, we will talk about .css-selectors and their definitions.
A CSS selector is a sequence of CSS rules that allow you to identify some html element or set of html elements on a page. In order to use DataExcavator successfully, you need to be able to work with .CSS selectors and create them correctly. Correct selectors allow you to extract the content of the necessary elements from the page content. If the selectors are described incorrectly, the program will return an empty value or a value of the wrong elements that you want.
Let’s look at the HTML content of the page. Any page consists of a set of html tags. Each HTML tag can have a number of attributes (id, class, style, data-, src, aria-, and so on), or have no attributes at all. .CSS selectors allow you to access some element through the attributes of this element or through the name of the html tag.
- For example, we have the element <div id=”itemname”>Some cool notebook</div> on the page. In this case, the .css selector for this element will be #itemname.
- Another example is that we have div <class=”item-price”>17$</div> element on the page. In this case, the .css selector for this element will be .item-price.
- Now the example is more complicated – we have <span data-type=”description”>This is a cool thing for you</span>. In this case, the .css-selector for this item will be [data-type=”description”]
Note that the construction of the selector depends on the type of attribute to which we refer.
- If we address the item id, the selector must look like #SomeIDName
- If we address to the item class, the selector must look like .SomeClassName
- If we address an arbitrary attribute of an item (data-*, aria-*, someattr*, etc.), the selector should look like [SomeAttrName=SomeAttrValue] or [SomeAttrName]
Now let’s consider more complex structures – access to the element of the set of attributes. Keep in mind that if we want to find an element by several attributes at once, then in the .CSS-selector these attributes must be connected sequentially, for example: .selector1.selector2.selector3[…]. So, for example, to select some element <div>Cool mobile phone</div>, we need to use this selector: .item-name.bold.green-background.
The same can be done to other attributes, for example: <div id=”lol-kek”>Lol phone</div> – .item-name.bold.green-background#lol-kek . Note that the order of the elements is not important – #lolkek.class1 = .class1#lolkek
Now let’s look at the selector cascade. It often happens that some element has a set of attributes, which is duplicated within one page, and is not unique. It means that we cannot rely on the construction of #attributeId or .attributeClassName. In such cases we look at what parent items have some interesting element and what attributes these parent items have. This allows us to create a cascade selector, which first turns to the parent and then to its flows, and returns the result only if such a chain is found taking into account its hierarchy. The cascading treatment in the selector implies using spaces between selector rules. For example, the .element1 .element2 selector means that we want to find an element with the class element2 that has a parent element1 on the page.
- Select all links from the page (<a href=”…”>…</a>): a, [href], a[href]
- Select all images from the page (<img src=”…”>…</a>): img, img[src]
- Select all items from the page with ID=ItemName (<div id=”ItemName”>…</div>): #ItemName, div#ItemName, [id=’ItemName’], div[id=’ItemName’]
- Select all elements with Class1 class (<div>…</div>): div.Class1, .Class1, div[class=’Class1′], [class=’Class1′]
- Select all the elements with Class1 class that are descendants of elements with ClassParent: (<div><div>…</div></div>): .ClassParent .Class1, div.ClassParent div.Class1, div[class=’ClassParent’] div[class=’Class1′], [class=’ClassParent’] [class=’Class1′] [class=’Class1′]
- Select all the elements with Class1 class that are the direct descendants of the element with ClassParent class (<div><div>…</div></div>): .ClassParent > .Class1, div.ClassParent > div.Class1, div[class=’ClassParent’] > div[class=’Class1′], [class=’ClassParent’] > [class=’Class1′]
The easiest way to check the performance of .CSS selectors for many sites is to use the library jQuery, which is built into the modules of the site. Use the web browser console to test the selectors. First of all, open the console, then enter a construction like $ and $ == jQuery. If the console returns you true, then jQuery is installed on this site and you can use this library to check the selectors. To check some selector, use a construction like $(“SELECTOR”).