Node JS Parse Gzip Compressed Web Page Example

There are a lot of webpages in the world are compressed during the transfer between web server and client browser, this can reduce the web page size and improve the transfer speed. And when web browser received the compressed web page, it will unzip it and render.

But when you use Node js to crawl such a web page content, you may got messy content data, because what you got is compressed so you need to decompress it again to get the real content, this article will just tell you how to do it.

1. How To Check A Compressed Web Page.

You can use google chrome web browser inspector to check whether a web page is compressed or not.

  1. Open Chrome and browse https://www.yahoo.com as example.
  2. Right click the web page and click Inspect menu item in the popup menu list.
  3. Then click Network tab in right panel. And click the domain name www.yahoo.com in left panel, then you can find the content-encoding : gzip header in Response Headers area in right panel. This means this web page is compressed using gzip.
    check whether a webpage is compressed or not with google chrome inspector

2. Decompress WebPage Use Node JS Built-in zlib Module.

zlib module is a built-in module in node js. It can be used to unzip most of compressed data. Below example will show you how to use it to decompress zipped web page.

parse_zip_webpage_use_zlib.js

// Import https module.
var https = require("https");

// Import zlib module.
var zlib = require('zlib');

var requestOptions = {
    protocol:'https:',
    hostname:'www.yahoo.com',
    port:'443',
    method:'get'
};

// Create a http.ClientRequest object
var request = https.request(requestOptions, function (resp) {

    // This array is used to save all server returned web page data.
    var htmlData = [];

    // When server return any data.
    resp.on('data', function (data) {
        // Push server returned data to the array.
        htmlData.push(data);
    })

    // When server return data complete.
    resp.on('end', function () {

        // Create a buffer object to save the compressed web page data.
        var buffer = Buffer.concat(htmlData);

        // Use zlib to unzip the web page content.
        zlib.gunzip(buffer, function(err, decoded) {

            // Print unzipped data to the console.
            console.log(decoded.toString());
        });
    })

});

// Finish sending the request. Then serve will process this request.
request.end();

3. Decompress WebPage Use request Module.

request module is a third party node js module, you need first install it before using it.

  1. Open a terminal.
  2. Execute npm install request -gcommand.
  3. After some time, you can find the request folder has been added in /usr/local/lib/node_modules. Now you can use it.
READ :   How To Create Custom NPM Modules

parse_zip_webpage_use_request.js

// Import request module, this is a third party module, you need use npm to install it first.
var request = require('request');

// Create the request options.
var reqOptions = {
    url: 'https://www.yahoo.com',
    gzip: true
};

// Request the web page with specified options.
request(reqOptions, function(err, resp, body) {

    // Get all response header names.
    var headerNameArray = resp.rawHeaders;

    var headerArraySize = headerNameArray.length;
    // Loop in the response header names.
    for(var i=0;i<headerArraySize;i++)
    {
        var headerName = headerNameArray[i];

        // Get header value for each header name.
        var headerValue = resp.headers[headerName];

        // Print header name and value in the console.
        console.log(headerName + " = " + headerValue);
    }

    // Print out the web page body.
    console.log(body);
});
(Visited 123 times, 1 visits today)

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.