zondag 6 december 2009

Using WatiN to parse and scrape HTML

Of course you can scrape web pages with WatiN so why this blog post you might ask. Well, with the out-of-the-box WatiN browser support you need to instantiate a browser to get the HTML for a page and scrape it. If you don’t need to interact with the page, like clicking on links, type in some text or rely on cookies for session state, you don’t need all the overhead (memory, start up time) of the browser.

This article will show you how you can use WatiN browserless by implementing a new browser: MsHtmlBrowser, using techniques described in this article.

For this code I use the current development code which will soon be released as WatiN 2.0 RC. You can download the code for this post here.

Load a url with HTMLDocumentClass

The trick to make this work is to make use of IPersistStreamInit and HTMLDocumentClass.

First lets get the definition of IPersistStreamInit in place:


And this is how we can combine it with the HTMLDocumentClass (don’t forget to add a reference to the assembly Microsoft.mshtml.dll distributed with WatiN):


Using multi browser support in WatiN 2.0

Knowing this we now need to give this code a place in the WatiN architecture. With the introduction of multi browser support in WatiN 2.0, the architecture of the WatiN API has been changed to allow adding new implementations for different browsers without having to change the WatiN.Core code. To create a new Browser implementation we need to create concrete implementations for the INative* interfaces specific for the browser we are adding. Since we are basing our new browser implementation on the same mshtml dll that Internet Explorer uses, we can reuse a lot of the IE specific native classes already available in the WatiN.Core.Native.InternetExplorer namespace.

But since we don’t want to use WatiN.Core.IE and need to use WatiN.Core.Browser to tap into the WatiN architecture, we need to create our own browser class. Lets call it MsHtmlBrowser (inheriting the abstract class Browser). This forces us to implement 2 abstract methods (WaitForComplete and Close) and 1 abstract property (NativeBrowser). Lets focus on implementing the NativeBrowser property.

Implementing MsHtmlNativeBrowser

NativeBrowser returns a type implementing INativeBrowser. So lets create a class MsHtmlNativeBrowser which implements INativeBrowser. This requires several methods and properties to be implemented. Many of these we can’t provide an implementation for since we aren’t wrapping a real browser. Two of these we need to focus on NavigateTo(url) and the property NativeDocument.

As you might guess, we can add our code (to load a page in an HtmlDocumentClass instance) to the NavigateTo method. After initialization of the object we wrap it into an IEDocument which will be returned by the NativeDocument property. Following the implementation. All the other methods do throw a NotImplementedException.


Back to implementing MshHtmlBrowser.

MsHtmlBrowser continued

Now that we have MsHtmlNativeBrowser, we can return an instance of this class in the NativeBrowser property of MsHtmlBrowser.

The implementation of the Close method is empty since we don’t have a real browser we need to close. You might consider disposing the instance of MsHtmlNativeBrowser here.

For the WaitForComplete method we can reuse functionality in the IEWaitForComplete class by passing in the IEDocument instance of the MsHtmlNativeBrowser.

This results in the following implementation:



Using the new MsHtmlBrowser


Which concludes this example on creating a browserless Browser implementation.

Enjoy testing with WatiN!

Technorati Tags:

2 reacties:

Jeff Brown zei

I wonder whether it's still possible to render a screenshot of the document on demand, say by printing to a memory DC.

Incidentally, the screenshot support needs to be refactored and moved into individual native browsers at some point...

tak zei

Can you also describe how to save the current page in Watin? The issue is on: