Quantcast
Channel: python – united-coders.com
Viewing all articles
Browse latest Browse all 16

Use compressed data directly – from ZIP files or gzip http response

$
0
0

Did you use compressing while using resources from your scripts and projects? Many data files are compressed to save disc space and compressed requests over the network saves bandwith.

This article gives some hints to use the “battery included” power of python for the handling of compressed files or use HTTP with gzip compression:

  1. reading log files as GZIP or uncompressed
  2. using files directly from ZIP- or RAR compression archive
  3. use gzip compressing while fetching web sites

reading GZip compressed log files

The most common use case for reading compressed files are log files. My access_log files are archived in an extra directory and can be easy used for creating statistics. This is done normally with zgrep/zless and other command line tools. But opening a gzip-compressed file with python is build-in and can be transparent for the file handling.

import gzip, sys, os

if len(sys.argv)<1:
    sys.exit()
try:
  fp = gzip.open(sys.argv[1])
  line = gzip.readline()
except IOError:
  fp = open(sys.argv[0])
  line = fp.readline()

while line:
  ...

The tricky part is the IOError line. If you open a plain text file with the GzipFile class it fails while reading the first line (not while calling the constructor).

You can use the bz2-module with the BZ2File constructor if you have bzip2-compressed files.

Reading a file directly from zip archive

If you need a british word list your linux can help (if the british dictionary is installed). If not you can get it from several sources. I choosed the zip-compressed file from pyxidium.co.uk and included it in a small python script.

Zip-files are a container format. You can put many files in it and have to choose which file you want to decompress from the ZIP-file. In the example I will fetch the ZIP achive into memory and unzip the file

en-GB-wlist.txt

directly.

import zipfile, os, StringIO, urllib

def openWordFile():
    #reading english word dictionary
    pathToBritishWords = "/usr/share/dict/british-english"
    uriToBritshWords = "http://en-gb.pyxidium.co.uk/dictionary/en-GB-wlist.zip"

    if os.path.isfile(pathToBritishWords):
        fp = file(pathToBritishWords)
    else:
        #fetch from uri
        data = urllib.urlopen(uriToBritshWords).read()
        #get an ZipFile object based on fetched data
        zf = zipfile.ZipFile(StringIO.StringIO(data))
        #read one file directly from ZipFile object
        fp = zf.open("en-GB-wlist.txt")
    return fp

#read all lines
words = openWordFile().readlines()

print "read %d words" % len(words)

If you want to read directly from a RAR file you have to install the rarfile module.

sudo easy_install rarfile

The module use the command line utility rar/unrar, but the usage is the same like the zipfile module.

import rarfile
rf = rarfile.RarFile("test.rar")
fp = rf.open(“compressed.txt”)

Use gzip compression while fetching web pages

The speed of fetching web pages has many parameters. To save the important parameter band width you should fetch http resources compressed. Every modern browser support this feature (see test on browserscope.org) and every webserver should be able to compress the text content.

Your HTTP client must send the HTTP header “Accept-Encoding” to offer the possibility for compressed content. And you have to check the response header if the server sent compressed content. A web server can ignore this request header!

import urllib2, zlib, gzip, StringIO, sys

uri = "http://web.de/index.html"
req = urllib2.Request(uri, headers={"Accept-Encoding":"gzip, deflate"})
res = urllib2.urlopen(req)
if res.getcode()==200:
    if res.headers.getheader("Content-Encoding").find("gzip")!=-1:
        # the urllib2 file-object dont support tell/seek, repacking in StringIO
        fp = gzip.GzipFile(fileobj = StringIO.StringIO(res.read()))
        data = fp.read()
    elif res.headers.getheader("Content-Encoding").find("deflate")!=-1:
        data = zlib.decompress(res.read())
    else:
        data = res.read()
else:
    print "Error  while fetching ..." % res.msg
    sys.exit(-1)

print "read %s bytes (compression: %s), decompressed to %d bytes" % (
    res.headers.getheader("Content-Length"),
    res.headers.getheader("Content-Encoding"),
    len(data))

As a developer I did not found the automatic support for gzip-enabled HTTP requests for HTTP clients in different libraries. And python dont offer the support build-in too. Copy/paste this lines in your next project or convert it in your favorite language and your HTTP request layer will become faster.

conclusion

One disadvantage: your software will consume some percent more cpu to decompress the data on-the-fly and will be slower on your local machine. Python use the c-binding to the zlib and is fast as any other component and in a network environment you can messure the benefit.

The post Use compressed data directly – from ZIP files or gzip http response appeared first on united-coders.com.


Viewing all articles
Browse latest Browse all 16

Trending Articles