Using node.js to parse and split HTML files

Parsing and manipulating text files is a job as old as computers themselves it seems like.

Every now and then every developer will have a file or a bunch of files that they need to do some manipulation on. For one off jobs, a bit of sed and awk might do the job, but the syntax is not exactly friendly, particularly if you're working with HTML.

The problem

I have a single large HTML file with a lot of divs. Each div has an id attribute that is unique within the file. I wanted to split this file into one file per div and name that file after the id of the div.

Sub div ....more stuff ...some more html...

This would become:

section1.html

Sub div ....more stuff

section2.html

...some more html...

The tool I chose to perform this little task for me was node.js is a way of running Javascript on the server and not in a browser. The reason I chose node was because it lets me use a language I'm using very frequently at the moment and because it feels like quite a natural way to work with HTML as that is what we do day-to-day on the web.

Cheerio

Cheerio is a small and fast implementation of the jQuery API for working with HTML. This means I can use the familiar jQuery API for traversing and manipulating HTML within node.

  • Load in original HTML file
  • Use cheerio to find the elements I want to split out
  • For each element, get the id
  • Write a file named after the element id containing the HTML for that element
var cheerio = require('cheerio'),
    fs = require('fs');

fs.readFile('complex.html', 'utf8', dataLoaded);

function dataLoaded(err, data) {
    $ = cheerio.load('' + data + '');
    $('#topLevelWrapper > div').each(function(i, elem) {
        var id = $(elem).attr('id'),
            filename = id + '.html',
            content = $.html(elem);
        fs.writeFile(filename, content, function(err) {
            console.log('Written html to ' + filename);
        });
    });
}

Note that some error handling has deliberately been left out to keep this example simple.

And that's it, less than 20 lines of pretty easy to understand Javascript.

This site uses cookies. Continue to use the site as normal if you are happy with this, or read more about cookies and how to manage them.

X