It Takes a Whole Life to be Arrix

November 20, 2010

Server side readability with node.js

For people just looking for code: node-readability on github

Readability by Arc90 is a fantastic javascript tool that makes web pages easier and more enjoyable to read. It removes the clutter around the article content and applies legible and beautiful styles. Apple has incorporated it into Safari Reader. Many other apps have integrated readability, too.

While it's fast and pleasant on a modern desktop browser, the performance on resource limited devices is still unsatisfactory. It often takes several seconds to process a page on my iPad. On Nexus One, the time is much longer.

Several efforts have been made to port the readability algorithm to server side. Including:

All these ports have deviated from the original implementation to adapt to the target technology, which means they may produce different result and extra work must be done to keep them up to date if a newer version of readability comes out.

It would be nice if we could run readability.js in a server side javascript host environment so that

The result is as close to that in browsers as possible
Minimal changes are required

So I took a stab to adapt readability.js to server side using node.js and jsdom. The code is available on github and there is a live demo (coming soon). The result is quite good in my testing except it's a bit slow.

Here is an example usage

var fs = require('fs'),
readability = require('./lib/readability.js');

var html = fs.readFileSync('test/nytime.html', 'utf-8');

// This is an very early example. The API is subject to change.
readability.parse(html, 'http://www.example.com/somepage.html', function(result) {
    console.log(result.title, result.content);
});

Porting readability.js to node.js

There isn't a full browser environment available for node.js.
Features that can not easily made to work are disabled for now. e.g. Fetching next pages, iframe support.

Another reason for disabling the two features is to keep the code synchronized. node.js is a single threaded event driven execution environment. There is nothing like locking. But readability is written as a singleton one shot object. So I have to reset states before before every run. If the code stops halfway to wait for IO, it might be re-entered before the current run finishes, which surely must be avoided.

Some code paths are disabled because they don't make sense in a non-browser environment but may cause problems.

Some NodeList iteration loops are slightly modified to work around a jsdom limitation where a live NodeList isn't updated automatically after DOM changes when accessed via indexing.

Readability looks for comma(, ) when calculating scores. I've extracted hard coded literals to a variable so that it can be configured to match punctuations in multiple languages. var reComma = /[\uff0c,]/; //chinese comma, too

Readability UI elements, header and footer aren't included in the result. This is merely done to allow more flexible usage. I'd like to include acknowledgement to Arc90 in any final product and suggest all of you do the same.

Most time is spent for performance optimization. See below.

Performance

The first working version was incredibly slow. It was common to take 5-10 seconds to process a moderately sized page. Certain pages can take minutes as if the process is freezing. While node.js uses the very fast V8 javascript engine, the DOM implemented in jsdom uses pure javascript and isn't optimized for performance yet.

I added simple profiling code so that I can see how much time is taken by each step and find code paths worth optimizing most. Below is a sample output for http://en.wikipedia.org/wiki/Ruby

19 Nov 20:57:32 - ---DOM created
19 Nov 20:57:32 -     0 seconds [Remove all stylesheets] 
19 Nov 20:57:32 -     0 seconds [Turn all double br's into p's] 
19 Nov 20:57:32 -   0.05 seconds [prepDocument] 
19 Nov 20:57:33 -     0.455 seconds [grabArticle nodePrepping] 
19 Nov 20:57:33 -     0.015 seconds [grabArticle calculate scores] 
19 Nov 20:57:33 -     0.227 seconds [grabArticle find top candidate] 
19 Nov 20:57:33 -     0.033 seconds [grabArticle look through its siblings] 
19 Nov 20:57:34 -       0.043 seconds [cleanConditionally] 
19 Nov 20:57:34 -       0.2 seconds [cleanConditionally] 
19 Nov 20:57:34 -       0.032 seconds [cleanConditionally] 
19 Nov 20:57:34 -       0.054 seconds [cleanConditionally] 
19 Nov 20:57:34 -       0.026 seconds [prepArticle Remove extra paragraphs] 
19 Nov 20:57:34 -       0.206 seconds [prepArticle innerHTML replacement] 
19 Nov 20:57:34 -     1.372 seconds [prepArticle] 
19 Nov 20:57:34 -   2.407 seconds [grabArticle] 
19 Nov 20:57:34 - 2.53 seconds [================= TOTAL] 
Profiling summary ==========================
    1   2.530 ================= TOTAL
    1   0.050 prepDocument
    1   0.000 Remove all stylesheets
    1   0.000 Turn all double br's into p's
    1   2.407 grabArticle
    1   0.455 grabArticle nodePrepping
    1   0.015 grabArticle calculate scores
 2338   0.071 getInnerText
    1   0.227 grabArticle find top candidate
  105   0.259 getLinkDensity
    1   0.033 grabArticle look through its siblings
    1   1.372 prepArticle
    4   0.329 cleanConditionally
    1   0.026 prepArticle Remove extra paragraphs
    1   0.206 prepArticle innerHTML replacement

As shown in the summary, getInnerText is called many times, that's actually one function that I made hundredfold faster, cutting the running time by seconds.
element.textContent is rather slow in jsdom, so I replaced it with a tree walker.

function TextWalker(node, func) {
    function walk(cur) {
        var children, len, i;
        if (cur.nodeType == 3) {
            func(cur);
            return;
        } else if (cur.nodeType != 1) {
            return;
        }
        
        children = cur.childNodes;
        for (i = 0, len = children.length; i < len; i++) {
            walk(children[i]);
        }
    }
    walk(node);
}

var textContent = '';
TextWalker(e, function(cur) {
    textContent += cur.nodeValue;
});

Tree walkers like above are also used in other places to speed up NodeList iteration. As we know, the getElementsByTagName() function family return a live NodeList which is updated automatically when the DOM changes. Live NoteLists are very fast in most browsers because of highly efficient caching. That's why getElementsByTagName() is much faster than querySelectorAll().

But in jsdom, things are quite opposite. Keeping live NodeLists up to date is very expensive in jsdom ~~because there is no caching at all. In a tight loop that modifies DOM, live NodeLists are just unaffordable.~~ Update: There is simple version number based caching now but a tree walker is still much faster.

So a carefully crafted tree walker is used to replace live NodeList in the "node prepping" part in grabArticle(). This optimization is significant, reducing running time for certain pages from several minutes to seconds.

So far so fast - 1.1 seconds per page

These optimizations turned out to be very effective. In my testing of 140 pages with an average size of 58KB collected from digg, delicious and hacker news, the average time taken for each page is about 1.1 seconds on a Mac Mini (2.4G Intel Core 2 Duo).

The task is CPU bound. The running time is often not linear to DOM size. DOM content and structure can greatly impact performance. The slowest case is when readability fails to extract enough content so it reruns the algorithm with more aggressive configurations. I believe that with more specific tuning for jsdom, the running time can be further reduced.

Limitations

While the port gives good result for most pages, Node.js + jsdom isn't a full browser environment so the missing features impose limitations.

Server side DOM environment doesn't understand CSS. It is reasonable for readability to make use of CSS information. For example, it could discard invisible or too small elements. Although currently readability hardly uses any CSS information, this would be an important limitation in future.

The next step

Coming from the world of servers and back end systems, I care about performance most. I'm going study jsdom more closely to understand its performance characters better. For now, some options in my mind include

Replace live NodeList iteration with DOM tree transversal when applicable.
Combine DOM transversals. Do several things in one go.
Avoid large innerHTML assignment when possible. HTML parsing and DOM creation are expensive.

Regarding readability.js itself. I'd like to suggest a few improvements. I'd be happy if I can contribute.

Separate the library part from the bookmarklet part. The core algorithms can be extracted as a library. This allows the core function to be used with different front ends. e.g. A browser extension, a widget.
Organize the core into several modules and break big functions into smaller ones.
Add some hooks and more configurations so that it's possible to do page specific optimization in a plug-in manner.
Currently, readability is a singleton one shot object. It'll be nice to make it classy so that state management will be easier in an event driven architecture.
Be unobtrusive. It would be nice if the original DOM can be left intact.

At the end of this post, I'd like to thank Arc90 for building such a wonderful tool and Elijah Insua (@tmpvar), whose jsdom has opened many seriously cool possibilities for server side javascipt.

April 21, 2010

Augmented Reality on iPhone: Marker Tracking Demo

I've been working on an augmented reality game app for iPhone recently. The player will be able to interact with the game by placing and moving specially designed marker cards in front of the iPhone's camera. The app analyzes video frames in real time and updates the scene accordingly. Here is demo of what I've got so far - marker tracking on iPhone 3GS.

The marker tracking algorithm is done by a slightedly modified version of NyARtoolkitCPP. Analyzing each frame takes < 55ms in average. The OpenGL ES rendering can achieve 30 frames per second. The current algorithm relies on thresholding so it is very sensitive to changing lighting condition. Also, template matching based identification requires matching against the whole pattern library for each detected marker. I'm seeking for better algorithms that use edge detecting and id recognition. StbTracker, the successor of ARToolkitPlus, looks ideal for devices such as iPhone but unfortunately it's not open source.

The most discussed topic by iPhone AR application developers today is real time video frame grabbing. There isn't an elegant way yet. For now, I'm using private APIs to repeatedly fetch the camera preview view. The good news is that we are going to have full access to still and video camera data in iPhone OS 4. A peek at the related APIs in the iPhone SDK 4 beta 2 further confirms that.

The OpenGL ES overlay content is displayed in a second UIWindow above the one containing the camera preview. This only point of dosing so is to get a clean preview image without the overlaying content. These ugly hacks will be all gone once we have iPhone OS 4.

February 21, 2009

Bypassing code signing to test your app on the iPhone device

Apple requires apps to be signed with a signing identity before you run them on device or distributing them to App Store. To get a valid signing identity (code signing certificate + private key), you must apply for Apple's iPhone developer program and pay them $99.

It is possible to run and test your app on your jailbroken iPhone without the need to pay (you'll still have to pay if you wish to publish your app on the App Store anyway). This process is often referred to as bypassing code signing.

Open Cydia on your iPhone, in the home page, scroll down and tap Bypassing Code Signature in the Developer section, you'll see instructions to bypass code signing. There are 3 options. Option #1 is what I tried. You can find detailed instructions for the other options on the internet.

Option #1: The verification check in the iPhone OS has been hacked so you can sign your code with any self-signing certificate and iPhone will let you pass. If you want the most integrated and smooth development experience with Xcode (Build and Go and debug on the device), this is the option for you.
You need to do two things.
The first is to bypass the Xcode verification.

Create a self signing certificate named iPhone Developer following the instructions by Apple Obtaning a Signing Identity .
in Xcode - > project - > edit project setting -> Add User - Defined Setting
PROVISIONING_PROFILE_ALLOWED : NO
PROVISIONING_PROFILE_REQUIRED : NO
(there seems to be a way in which you don't have to add these settings on a per project basis but I haven't tried. see Developing Application for iPhone OS)
add in info.plist - > SignerIdentity : Apple iPhone OS Application Signing
(from http://forums.macrumors.com/showthread.php?t=640341 )

Now you should be able to choose device targets and build your app. Try clean if you see code signing related errors.
You'll probably get a device verification error if your iPhone's Mobile Installation files haven't been patched yet. So the second thing to do is to compromise the verification on the iPhone.

add http://cydia.hackulo.us to your Cydia sources
install miPatch in Cydia
reboot your iPhone (I didn't reboot and it worked IIRC)

Done.

see Developing Application for iPhone OS for more information.

Option #2: Just build your app and scp it to your iPhone and run ldid -S yourapp. ldid (Link Identity Editor ) is a command line package available in cydia. see http://www.ipodtouchfans.com/forums/showthread.php?t=104884

Now I can debug on my device live - tap a button and stop at a breakpoint set in Xcode. I wouldn't want to write any iPhone app without a real Mac!

Legal Notice: I don't live in a country where this kind of notice is favoured. But for completeness, please know that bypassing Apple's restrictions is usually illegal. However, $99 is too much for a curious guy who just wants to test his Hello World.

February 17, 2009

Private methods in Objective-C

Private method in Objective-C

Objective-C doesn't have the concept of private methods although you can use @private, @public, @protected to specify visibility scope for variables of a class.
One way to simulate private methods is to use category.


// in MyClass.m, before the main @implementation block
@interface MyClass (Private)
- (void)privateMethod;
@end

// you won't get warnings if this block is missing
@implementation MyClass (Private)
- (void)privateMethod {
  //do sth...
}
@end

Note that it's a good practice to write an @implementation block for the category as soon as you finish the category @interface declaration. If the @implementation block for the category is missing, the compiler won't warn you about any missing method implementations.


// in MyClass.m, before the main @implementation block
@interface MyClass ()
- (void)privateMethod;
@end

@implementation MyClass
- (void)privateMethod {
  //do sth...
}
@end

You can also leave the category name blank to use an extension. If you do so, you must implement the methods in the main @implementation block. The compiler will always issue warnings when implementations for the methods are missing.

References

January 27, 2009

marquee generates swarms of scroll events in Webkit

The non-standard marquee tag, first introduced in early versions of IE, can be useful to create scrolling effects. Though deprecated, all major browsers today seem to support it.

sample scrolling text in marquee

One thing you should be aware of is that in Webkit (used by Appla Safari and Google Chrome), marquee generates a steady stream of scroll events. If you have listeners to the scroll event on the document, watch out for these events which bubble up to the document! You may want to filter out them by checking event.target. For example:


$(document).bind('scroll', function(event) {
 //filter out scroll events of marquees in Webkit
 if (event.target != document) return;
 //documrent scroll...
});

August 27, 2007

Screen scraping with jQuery

A test case in my work requires a complete list of HTML elements and a list of self-closing elements (e.g. <br/>).

The W3C Index of HTML 4 Elements lists all defined elements in a table. For each row with an "E" in the Empty column, the corresponding element doesn't need a closing tag (and thus is self-closing).

With two lines of jQuery code in the Firebug console, I got the lists I wanted. Here is how:

To get all elements

$.map($('table tr:gt(0) a'), function(e) {return $.trim($(e).text());})

To get all self-closing elements (formatted for readability)

$.map(
  $('table tr:gt(0)').filter(function() {
    return $(this).find('td:nth-child(4)').text() == 'E';
  }), 
  function(e){return $.trim($(e).find('td:first-child').text());});

$.trim() is needed because the HTML source contains \n in the Name column.

This demonstrates a handy usage of jQuery as a hacking tool. Another excellent demonstration can be found here.

You can add jQuery to the current page using the jQuerify bookmarklet.

Happy jQuerifying!

July 26, 2007

Ruby code: Finding the closest point

We want to fetch an avatar of specific size. How to find the closest size from all available sizes? Suppose available sizes are 16, 48 and 96.

An obvious version is

  def closest_size(size)
    case
    when size < 32
      16
    when size < 72
      48
    else
      96
    end
  end

The value used in comparison tests is the average of two neighboring candidate sizes. The problem is that when our available sizes change, we need to add or remove when clauses and recalculate average values.

Here is a smarter way

def closest_size(size)
    points = [16, 48, 96]
    distances = points.map {|a| (size - a).abs}
    points[distances.index(distances.min)]
end

It finds the point with shortest distance to the given size. Now if we want to change candidate sizes, we only need to change the array literal. Further, we can even pass the candidate array as an argument.