Understanding Crawling And How To Audit For Crawling Issues
Updated: Jul 13, 2022
We can't talk about crawling without first discussing how search engines work. The purpose of search engines, like Google, Bing, Yandex, is to deliver search results that are relevant to a user's query. They are able to do this by crawling lots of pages on the web using their bots/crawlers/spiders, follow links within these pages to discover new ones, these pages are then processed and put in an index to be returned in search results when they are relevant to a user's query.
WHAT IS CRAWLING?
To understand what crawling is and the processes involved, you need a good understanding of why a website is crawled, and how it is done.
The purpose of crawling is to discover new pages, in order words, Crawling is the process where crawlers/bots/spiders discover new pages by following links within pages and accessing their XML sitemaps. The new pages found are put in a crawl queue to be crawled later.
A website's pages and contents cannot be crawled if it is not accessible.
Crawling is the process of accessing a page using bots, to discover new pages by following links and crawling their XML sitemaps.
For Search Engine bots, this means that they are always crawling the web to discover new content or even an update to a previously crawled page.
How Do Search Engine Crawlers Crawl A Website?
I like to picture the process search engine bots take to crawl by looking at how tools like Sitebulb crawl a website. I imagine it is the same process since they are in fact, imitating how search engine crawlers crawl a site.
Sitebulb Crawl of a website
Here, you will see that
The URL status was first checked to see if it was accessible.
Next, the URL's robots.txt was downloaded to check for directives: what to crawl & not crawl, and the location of the XML sitemap.
The URL's content was accessed, and new pages found were queued to be processed, rendered, and indexed.
This process aptly illustrates how search engines crawl a website by accessing the site’s robots.txt file to know where they are allowed to crawl and not crawl, and the location of the sitemap with a list of URLs on the site. The URLs are processed and indexed, with new pages found through links added to the crawl queue, and also undergoing processing and indexing.
What Is Crawled?
Crawlers don't just crawl web pages, they also crawl anything that could be of value to a user on the page, such as images, videos, text documents, maps, etc.
Just like the Sitebulb crawl illustration, all the content of a URL is crawled and processed including the text, images, videos, links, etc.
Want to see how Google is crawling your website? Check the crawl stats report in Google Search Console by; Login to GSC > Settings > Crawl Stats > Open Report
It contains information on the total number of crawl requests your website has received from Google in the last 90 days.
Why Google May Not Crawl Your Page
A page that is not accessible cannot be crawled, and neither will the contents in them. This could be a result of the following;
The page is blocked in robots.txt. or using .htaccess file:
It is not in your XML Sitemap.
It is an orphan page, i.e. no links point to the page so Google can't find it.
They have links pointing to them but these are marked rel="nofollow".
The page has JS issues such as JS blocked in robots.txt preventing Google from accessing the content of the page.
How To Audit Your Website For Crawlability Issues.
For your website to be indexed and show up in search results when a user requests for its content, it needs to be crawlable. To find out if a website is crawlable;
Audit the robots.txt file: Are important pages or resources blocked here? Validate the file using Google Search Console's robots.txt tester to see if there are any errors or warnings triggered.
Are there any issues in Google Search Console's coverage report hinting at pages not being crawled?
Run a crawl using tools like Screaming Frog to find pages blocked by robots.txt.
Check if any host issues are reported in your Crawl stats in Google Search Console.
Ensure your important indexable pages are included in the XML sitemap.
Make sure your pages are internally linked with no broken links, no rel="nofollow" (if you want them to be followed), and the links should have an 'href' attribute).
PRACTICAL ILLUSTRATION OF HOW A WEBSITE CAN BE AUDITED FOR CRAWLABILITY ISSUES
I opened 'Exquisitetouche.com' robots.txt to manually check if there are important pages or resources that were disallowed from crawling. The website had the default setting for WordPress robots.txt, and there was no important content blocked there.
Also, I tested the homepage's URL in Google's robots.txt tester and there were no errors or warnings shown. Googlebot can crawl this page (allowed).
Google's robots.txt tester
I audited the website using Sitebulb, and there were no URLs disallowed from crawling.
Internal URLs with none disallowed
There were also no host issues reported in the Crawl Stats in Google Search Console.
No issues in Host Status
According to Google;
Host status describes whether or not Google encountered availability issues when trying to crawl your site.
And, all the pages that are supposed to be indexed are included in the XML sitemap
However, when I checked the 'coverage' report in Google Search Console, there were lots of 'Discovered-currently not indexed' URLs in the excluded tab. Out of about 63 URLs that the website has, 24 have this issue.
A 'Discovered-currently not indexed' report means that Google has come across the URLs but has not crawled nor indexed them.
Now, the question is WHY?
Further inspection of one of the URLs indicates that it was discovered in the XML sitemap but, it also shows that there are no referring pages.
This means that there are no pages linking to it, and when Google isn't finding more ways to discover a page, it does not think it's important to crawl.
However, it could also be that the content of the page is of poor quality and needs to be improved.
For this particular website, 24 pages is a lot to not be crawled and indexed for a site with just 63 pages. I will recommend adding internal links from indexed popular/top pages to these ones that have not been crawled. This is because Google tends to re-crawl pages in their index to find updated content and will likely crawl any links from them leading to those pages.
To further improve these URLs' chances of being crawled and subsequently indexed, ensure that they do not have thin content, are relevant, and can satisfy users' intent.