This site uses cookies. Continue to use the site as normal if you are happy with this, or read more about cookies and how to manage them.

×

This site uses cookies. Continue to use the site as normal if you are happy with this, or read more about cookies and how to manage them.

×

Using node.js to parse and split HTML files

Parsing and manipulating text files is a job as old as computers themselves it seems like.

Every now and then every developer will have a file or a bunch of files that they need to do some manipulation on. For one off jobs, a bit of sed and awk might do the job, but the syntax is not exactly friendly, particularly if you're working with HTML.

The problem

I have a single large HTML file with a lot of divs. Each div has an id attribute that is unique within the file. I wanted to split this file into one file per div and name that file after the id of the div.

Sub div ....more stuff ...some more html...

This would become:

section1.html

Sub div ....more stuff

section2.html

...some more html...

The tool I chose to perform this little task for me was node.js is a way of running Javascript on the server and not in a browser. The reason I chose node was because it lets me use a language I'm using very frequently at the moment and because it feels like quite a natural way to work with HTML as that is what we do day-to-day on the web.

Cheerio

Cheerio is a small and fast implementation of the jQuery API for working with HTML. This means I can use the familiar jQuery API for traversing and manipulating HTML within node.

  • Load in original HTML file
  • Use cheerio to find the elements I want to split out
  • For each element, get the id
  • Write a file named after the element id containing the HTML for that element
var cheerio = require('cheerio'),
    fs = require('fs');

fs.readFile('complex.html', 'utf8', dataLoaded);

function dataLoaded(err, data) {
    $ = cheerio.load('' + data + '');
    $('#topLevelWrapper > div').each(function(i, elem) {
        var id = $(elem).attr('id'),
            filename = id + '.html',
            content = $.html(elem);
        fs.writeFile(filename, content, function(err) {
            console.log('Written html to ' + filename);
        });
    });
}

Note that some error handling has deliberately been left out to keep this example simple.

And that's it, less than 20 lines of pretty easy to understand Javascript.