The CSS selector is a special expression used to address nodes in an HTML page. Using CSS selector, you can easily point the program to a specific block on the page.
Many data scraping applications, including our application, actively use CSS selectors. Thus, knowing how selectors work, you can quickly set up very many of the appropriate programs.
When we talk about selectors, we mean the CSS classes that the page elements have. As we know, a page consists not only of HTML code, but also of styles that give the elements a different look. All styles are described with the help of a special CSS language, and the interaction between this language and the page is done with the help of CSS-classes and attributes.
Almost all sites use CSS technology. We chose it as the basis of our scraper because it is versatile enough, it is known to many professionals and it is easy enough to understand and apply.
In general, CSS selectors usually include any element attributes located inside the HTML code of the page. These can be such attributes as id, class, data-name, data-*, src, href, and anything else. And of course, from a formal point of view, we are being stoned by HTML evangelists 😬 (because scientifically it should be explained differently), but this explanation is much more understandable for beginners 🤷♂️
This is what the code of the page that has CSS classes looks like.
What does the CSS selector look like?
Please remember the basic rules for describing selectors. These are universal rules, which are suitable not only for our application, but also for any other case when you need to use .CSS selectors.
To access an element by class name, use a point character concatenated with the class name. For example: .page-title, .page, .itemblock, .price and so on. That is, to access the element class=”abc”, use the .abc selector.
To access an element by identifier, use the grid character – #. For example: #page-title, #item-price, #description, #item-images. That is, if there is an element with id=”fgh”, you can address it like this: #fgh
To access an item by its attribute, use square brackets – . For example: [id=”item”], [aria-label=”foo1″] and so on. That is, if you want to access an item like data-position=”low-block”, use the [data-position=”low-block”] selector.
How to pick any CSS-selector?
First of all, you need to inspect page source code. If you’re using GoogleChrome, just open “Developer tools” and navigate to “Elements” tab. You can use hotkeys combination like a “Shift + CTRL + J” to do it. Here is a full article about this action: https://developers.google.com/web/tools/chrome-devtools/open
Next, you need to find the item to which you want to point to our program. It can be a picture, the title of some block or a piece of text. Whatever it is, it is very likely that you can describe it with selectors.
You can then create a CSS selector by pointing directly at the element using a simple expression (e.g. #page_title) or a cascade of expressions (e.g. #page #page_title). Please note that depending on the site and page selectors may be completely different and not similar to each other. There is no universal set of selectors, which by default would indicate all the necessary elements of the page.
For example, at amazon, the page header is specified in the h1 tag, which has the classes “a-size-large product-title-word-break” and the identifier “productTitle”. Thus, for the product page at amazon.com selectors can look as follows: h1, h1#productTitle, #productTitle, .product-title-word-break and so on.
Universality of writing selectors
CSS selectors can have universal spelling. This means that you can specify the same element on the site in several ways. This is an amazing technology that adapts very flexibly to your needs.
For example, if you want to point the system to a header block with h1 tag and page-title class, you can do it in at least four ways: h1, h1.page-title, .page-title and h1[class=”page-title”].
Yeah, it can be confusing at first 🤯. But in the end, you will see that .CSS selectors are a convenient and practical tool.
In addition to the versatility of writing, you can also use cascading expressions 🤓. This means that you can build a chain expression that first points to some parent element, which in turn points to a child element.
For example, you could do something like: .product-information .procut-primary-data h1.page-title. This would mean that you first need to find the block with the .product-information selector, then look for the .procut-primary-data child block, and so on.
In many cases, only the use of cascade selectors allows you to refer to an element. This is especially true for those pages which contain several elements with the same classes or identifiers.
Examples of CSS selectors and how to describe them
The table below lists the HTML elements and selectors for these elements. Use this as an example.
|Bacis examples||Possible .CSS-selectors|
|Extended examples||Possible .CSS-selectors|
|<h1 id=”title”>||#title, h1#title, #h1[id=”title”]|
|<a class=”link”>||.link, a.link, a[class=”link”]|
|<a data-label=”node”>||[data-label=”node”], a[data-label=”node”]|
|<a><div></div></a>||a > div|
How do I check if I have the right CSS selector?
Google Chrome developer tools will help you do this. We wrote how to open this panel a little higher. Only this time we will need to use the “Console” tool.
Open the console and type in the command document.querySelectorAll(‘…’), and instead of … put the selector you want to test, for example: document.querySelectorAll(‘#productTitle’). Click enter, and the console will display the result of running this selector. You will either see the items in the list or not.
How do I specify selectors in Data Excavator application?
In our application .CSS selectors are used to indicate the nodes to be removed from the page. With this setting, our scraper extracts the nodes from the page and saves them.
To specify these settings, you need to open the project settings and go to the “Data to scrape” tab. In the lower part of this tab you will be able to specify all the elements to be extracted from the site pages.
You can find more information about our application settings in the FAQ section.