Scraping website content using HtmlAgilityPack

Build you own website crawsler for extracting data from websites

HTML is a markup language same as XML, but still there are differences which make dealing with them a bit different. Basically HTML is a strict structure in terms of node sets and attribute sets, but in general more documents online are not exactly following the proper structure since browsers usually manage to deal with rendering out the document in UI.

However, this makes it more difficult to read them in case you want to from website itself as you cannot apply all XML rules since structure is not strictly followed.

It makes sense to use standard .NET classes which work with XML to parse HTML and in some cases it works, but for dynamic content websites, where content is updated daily or even more often and possibility of errors in structure are great, this might be not so useful.

If the XML structure is not followed in HTML document, which is often the case, XmlDocument or XDocument class will throw an exception and you cannot continue parsing of the online content. That makes it harder to scrape and filter online web content. Luckily there is an open source project called HtmlAgilityPack hosted on Codeplex. It is available as a NuGet package, so you can easily include it in your application

HtmlAgilityPack is more tolerant with nor well structured HTML which makes it perfect for building crawlers for scraping content from website. In the following example you'll find a simple example of scraping content from TechCrunch.com

It is not just enough to load the content into HtmlAgilityPack object instance. There are few things you need to thing about before you get the actual data you want from website page content.

Faking your request and introduce to website as a browser

You might face problems in the beginning later on if the website discovers that you are actually crawling it. To avoid this you need to modify you request to be the same as the one coming from the browser.

There is a nice article with code snipers I wrote related to this, so feel free to use it. I will use it in the following example so you can fetch it from there as well.

Acquring XPath filter for elements

To obtain element XPath you can use the tool build in GoogleChrome or use add-on called FirePath for Firefox browser. XPath that you get from these tools is not actualy the bet since it does not include filtering by attribute, but you can modify it. Still a lot better than writing the whole XPath from the scratch.

For GoogleChrome there is a nice extension called XPath Helper which generates XPath with including attributes of the elements.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using HtmlAgilityPack;
using System.Net;

namespace WebsiteContentParser
{
    class Program
    {
        static void Main(string[] args)
        {
            HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();
            htmlDoc.OptionFixNestedTags = true;

            string urlToLoad = "http://techcrunch.com/apps/";
            HttpWebRequest request = HttpWebRequest.Create(urlToLoad) as HttpWebRequest;
            request.Method = "GET";

            /* Sart browser signature */
            request.UserAgent = "Mozilla/5.0 (Windows NT 6.3; WOW64; rv:31.0) Gecko/20100101 Firefox/31.0";
            request.Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8";
            request.Headers.Add(HttpRequestHeader.AcceptLanguage, "en-us,en;q=0.5");
            /* Sart browser signature */

            Console.WriteLine(request.RequestUri.AbsoluteUri);
            WebResponse response = request.GetResponse();

            htmlDoc.Load(response.GetResponseStream(), true);
            if (htmlDoc.DocumentNode != null)
            {
                var articleNodes = htmlDoc.DocumentNode.SelectNodes("/html/body/div[@role='main']/div[1]/div/div[1]/div/ul/li");

                if (articleNodes != null && articleNodes.Any())
                {
                    foreach (var articleNode in articleNodes)
                    {
                        var titleNode = articleNode.SelectSingleNode("div/div/h2/a");
                        Console.WriteLine(WebUtility.HtmlDecode(titleNode.InnerText.Trim()));
                        Console.WriteLine("--------------------------------------------------------------------------");
                    }
                }
            }

            Console.ReadLine();
        }
    }
}
    

This will extract all news headlines from the homepage and simply list them in the console

Crawler Output 

Note

Parsing of the HTML can be easily done without any third party library by using RegularExpressions, but you have to be really good with them since they need to updated every time website structure is updated

Note that this solution requires to be updated as the website content structure updates to be able to parse HTML properly.

References

Disclaimer

Purpose of the code contained in snippets or available for download in this article is solely for learning and demo purposes. Author will not be held responsible for any failure or damages caused due to any other usage.


About the author

DEJAN STOJANOVIC

Dejan is a passionate Software Architect/Developer. He is highly experienced in .NET programming platform including ASP.NET MVC and WebApi. He likes working on new technologies and exciting challenging projects

CONNECT WITH DEJAN  Loginlinkedin Logintwitter Logingoogleplus Logingoogleplus

JavaScript

read more

SQL/T-SQL

read more

Umbraco CMS

read more

PowerShell

read more

Comments for this article