Apr

2015

Scraping website content using HtmlAgilityPack

Build you own website crawsler for extracting data from websites

HTML is a markup language same as XML, but still there are differences which make dealing with them a bit different. Basically HTML is a strict structure in terms of node sets and attribute sets, but in general more documents online are not exactly following the proper structure since browsers usually manage to deal with rendering out the document in UI.

However, this makes it more difficult to read them in case you want to from website itself as you cannot apply all XML rules since structure is not strictly followed.

It makes sense to use standard .NET classes which work with XML to parse HTML and in some cases it works, but for dynamic content websites, where content is updated daily or even more often and possibility of errors in structure are great, this might be not so useful.

If the XML structure is not followed in HTML document, which is often the case, XmlDocument or XDocument class will throw an exception and you cannot continue parsing of the online content. That makes it harder to scrape and filter online web content. Luckily there is an open source project called HtmlAgilityPack hosted on Codeplex. It is available as a NuGet package, so you can easily include it in your application

HtmlAgilityPack is more tolerant with nor well structured HTML which makes it perfect for building crawlers for scraping content from website. In the following example you'll find a simple example of scraping content from TechCrunch.com

It is not just enough to load the content into HtmlAgilityPack object instance. There are few things you need to thing about before you get the actual data you want from website page content.

Faking your request and introduce to website as a browser

You might face problems in the beginning later on if the website discovers that you are actually crawling it. To avoid this you need to modify you request to be the same as the one coming from the browser.

There is a nice article with code snipers I wrote related to this, so feel free to use it. I will use it in the following example so you can fetch it from there as well.

Acquring XPath filter for elements

To obtain element XPath you can use the tool build in GoogleChrome or use add-on called FirePath for Firefox browser. XPath that you get from these tools is not actualy the bet since it does not include filtering by attribute, but you can modify it. Still a lot better than writing the whole XPath from the scratch.

For GoogleChrome there is a nice extension called XPath Helper which generates XPath with including attributes of the elements.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using HtmlAgilityPack;
using System.Net;

namespace WebsiteContentParser
{
    class Program
    {
        static void Main(string[] args)
        {
            HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();
            htmlDoc.OptionFixNestedTags = true;

            string urlToLoad = "http://techcrunch.com/apps/";
            HttpWebRequest request = HttpWebRequest.Create(urlToLoad) as HttpWebRequest;
            request.Method = "GET";

            /* Sart browser signature */
            request.UserAgent = "Mozilla/5.0 (Windows NT 6.3; WOW64; rv:31.0) Gecko/20100101 Firefox/31.0";
            request.Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8";
            request.Headers.Add(HttpRequestHeader.AcceptLanguage, "en-us,en;q=0.5");
            /* Sart browser signature */

            Console.WriteLine(request.RequestUri.AbsoluteUri);
            WebResponse response = request.GetResponse();

            htmlDoc.Load(response.GetResponseStream(), true);
            if (htmlDoc.DocumentNode != null)
            {
                var articleNodes = htmlDoc.DocumentNode.SelectNodes("/html/body/div[@role='main']/div[1]/div/div[1]/div/ul/li");

                if (articleNodes != null && articleNodes.Any())
                {
                    foreach (var articleNode in articleNodes)
                    {
                        var titleNode = articleNode.SelectSingleNode("div/div/h2/a");
                        Console.WriteLine(WebUtility.HtmlDecode(titleNode.InnerText.Trim()));
                        Console.WriteLine("--------------------------------------------------------------------------");
                    }
                }
            }

            Console.ReadLine();
        }
    }
}

This will extract all news headlines from the homepage and simply list them in the console

Crawler Output

Note

Parsing of the HTML can be easily done without any third party library by using RegularExpressions, but you have to be really good with them since they need to updated every time website structure is updated

Note that this solution requires to be updated as the website content structure updates to be able to parse HTML properly.

References

Disclaimer

Purpose of the code contained in snippets or available for download in this article is solely for learning and demo purposes. Author will not be held responsible for any failure or damages caused due to any other usage.

About the author

DEJAN STOJANOVIC

Dejan is a passionate Software Architect/Developer. He is highly experienced in .NET programming platform including ASP.NET MVC and WebApi. He likes working on new technologies and exciting challenging projects

CONNECT WITH DEJAN

JavaScript

Oct

2018

HTML5 localStorage with expiry with vanilla JavaScript

Using HTML5 localStorage with expiry

May

2018

Monitoring DOM changes with JavaScript

Handling DOM changes with plain JavaScript

May

2017

Non blocking CSS load on the page

Load external CSS files in an async manner

Mar

2017

Serialize html form to JSON without using JQuery

Transform user input from HTML form fields to JSON

Jan

2016

Copy text value to clipboard using jQuery

Simple sample of using jQuery to copy value to clipboard

Jan

2016

Resize image on the client side with JQuery

Reducing the upload sie by resizing image on the client side

SQL/T-SQL

Feb

2022

Select column names with values from SQL Server database

Fetching column names with its value in T-SQL using built in JSON methods

May

2020

Identifying opened connections for the specific application in SQL Server

Connection listing queries in SQL Server

May

2018

Reading JSON data in T-SQL on SQL Server

Extracting values from JSON string on SQL Server using T-SQL

Apr

2016

Create XML/HTML with T-SQL

Generating XML/HTML output in SQL Server

Nov

2015

IP address to octets split in TSQL

Split IP addresse into octets in SQL Server

Jul

2015

Getting first and last second of the current year, month and day

Using minimum and maximum date time in SQL query

Umbraco CMS

Mar

2018

Minify HTML output of your pages

Minification of HTML output using ASP.NET IHttpModule

Apr

2015

Generate sitemap.xml on the fly in Umbraco CMS

Simple sitemap.xml Umbraco handler

Mar

2015

Accessing UmbracoHelper in HttpHandler request

Working with UmbracoHelper and IPublishedContent in HttpHandler

Sep

2014

Same page language switching in Umbraco

Land on the same page in different language in Umbraco using Relations

Sep

2014

Getting cropped image the smart way

The way to get cropped image URL with option to load original image too

Aug

2014

Fastest way to return JSON result from a controller

Resturn JSON in MVC controller action

PowerShell

Nov

2021

Adding centralized secrets service to Azure Service Fabric cluster

Adding secrets store and using store secrets in Azure Service Fabric

Jun

2021

Customizing PowerShell terminal with oh-my-posh v3

Setting up oh-my-posh v3 custom theme for PowerShell

Feb

2021

Setting Azure DevOps pipeline variable from PowerShell script

Output value from PowerShell script to Azure DevOps builds and releases

Sep

2020

Customizing WSL2 on Windows with screenfetch and oh-my-zsh

Setting up screenfetch and oh-my-zsh for use in WSL2 and Windows Terminal

Jan

2020

Generation Java client libraries for REST service with swagger-gen Azure DevOps

Compiling and serving MAVEN packages for Java with Azure DevOps using PowerShell

Sep

2019

Cloning Windows Virtual Machine in Azure without having to stop it

Zero downtime cloning of Virtual Machine in Azure using PowerShell

Scraping website content using HtmlAgilityPack

Build you own website crawsler for extracting data from websites

Faking your request and introduce to website as a browser

Acquring XPath filter for elements

References

Disclaimer

About the author

JavaScript

SQL/T-SQL

Umbraco CMS

PowerShell

Comments for this article