by Arun Kumar
Data collection using Python
Web crawling is a very useful technique to obtain data from websites. In Python there are various packages available to do the task very quickly. NLTK provide functions to clean and evaluate HTML files.
In this lesson we will learn:
• What is web crawling ?
• What are the libraries for web crawling in Python?
• How to use python for downloading a web page?
• How to clean and extract data from the HTML files using NLTK ?
In simple, Web crawling is the process by which we gather pages from the Web, in order to index them to support a search engine. The major objective of this process is to gather web pages from Internet or Web. When we crawl a website, we have to take care of some crawling issues. They are, honor the robots.txt file, a file describe what are allowed to crawl and not, tell the Web server who you are and reduce over loading of the server. Now we can start trying to download some HTML files using Python
Crawling restrictions in robots.txt. We can receive robots.txt using python module called robotparser
>>>import robotparser
>>>rp = robotparser.RobotFileParser()
The robotparser module contain functions, read, set url,can fetch
>>>rp.read ()
# This reads the robots.txt
>>>rp.can_fetch("*","http://neuro.compute.dtu.dk/wiki/Special:Search")
>>>False
# If this function return the value False means the crawling is
not allowed on the page
>>>rp.can fetch("*", "http://neuro.compute.dtu.dk/movies/")
>>>True
# If this function return the value True means the crawling is
allowed on the page
using urllib2
We can use urllib2 to download HTML files. This is a library to retrieve URLs and HTMl pages. This library can be used to download web pages from Internet. The module downloads web pages using HTTP protocol. This module supported by both Python 2.x and 3.x versions. It supports various network protocols HTTP, HTTPS, FTP etc. The examples here use HTTP protocol.
import urllib2
# The module imported
f = urllib2.urlopen(URL)
# urlopen is a function supported by urllib2 library to open
# HTML pages.
f = urllib2.urlopen('http://docs.python.org/2/library/urllib2.html')
# This open a HTML file to f
print f.read()
# It will display the content of HTML
Some web servers ask for authentication to download the content. To provide basic authentication urllib2 support a function.It is commonly known as handler.
urllib2.HTTPBasicAuthHandler()
# This handler can be used to provide basic authentication to the
server.
import urllib2
# The module imported
auth_handler = urllib2.HTTPBasicAuthHandler()
# Calling the handler
auth_handler.add_password((realm='PDQ Application',
uri='https://mahler:8092/siteupdates.py' user='klem',
passwd='kadidd!ehopper' )
#Here uri our Url, user is the user name, and password is the
password
opener = urllib2.build_opener(auth_handler)
# Building an opener
urllib2.install_opener(opener)
# Here we set the password and user name as global
# That means when the library try to open a particular web page it
#automatically provide the password and user name assigned by the
#handler
urllib2.urlopen('http://www.example.com/login.html')
# This will retrieve particular HTML page.
Once we receive the HTML file we have to clean the file to get content. Cleaning HTML file means removing tags and other HTML specific content from the file.
For example:
<!DOCTYPE html>
<html>
<body>
<h1>My First Heading</h1>
<p>My first paragraph.</p>
</body>
</html>
In this HTML code we are interested in content only, which is “My first paragraph” and “My First Heading”. So it is important to remove HTML tags such as <h1>, <p> etc. NLTK provides a function to do above task. To use that function, first we have to import nltk The function has a common form
nltk.clean_html(file object )
#Here we pass file object as argument
import nltk
html_doc = “””
<!DOCTYPE html>
<html>
<body>
<h1>My First Heading</h1>
<p>My first paragraph.</p>
</body>
</html>”””
print nltk.clean_html(html_doc)
Here we are calling nltk.clean_html(html_doc) to remove HTML tags. We can use nltk.clean_html() with urllib2 library also.
import urllib2
# The module imported
import nltk
# The module imported
f = urllib2.urlopen('http://docs.python.org/2/library/urllib2.html')
print nltk.clean_html(f.read())
# This code will display content of HTML page with out including HTMl #tags
It is also possible to use nltk.clean_html() with files
import codecs
# Package for Unicode support
import nltk
with io.open('Example.HTML', 'r', encoding='utf8') as infh:
raw_content= nltk.clean_html(infh)
# We open a HTML file named Example , using uio.open() method
# and the file object is passed to nltk.clean_html() function
# raw_content contains content of HTML page
with io.open('content.txt', 'w',encoding='utf8') as f:
f.write(raw_content)
# We write the content of raw_content to a text file called content
It is also possible to read and clean HTML files in another way
import nltk
url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"
html = urlopen(url).read()
# Here we are passing url to urlopen and reading it
print nltk.clean_html(html)
# print the content of HTML file.
The problem with nltk.clean_html() is it fails to clean complex HTML files. When our HTML files have complicated structures we have to use other tools such as BeautifulSoup.
For documentation
http://www.crummy.com/software/BeautifulSoup/bs4/doc/
Downloading files with wget and python
To download an entire website we can use wget with python.
wget is a program supported by GNU to download web pages in Linux platform. wget supports recursive downloading so we can use it to download large websites. The common form of wget is
wget http://www.website.com/file.zip
#Here it will download the zip from particular website
wget r http://www.website.com/index.html
#It recursively all files linked to a page
wget r U Mozilla http://www.website.com/index.html
# In the case a web server does not allow download managers,
#U can be used to tell the web server you are using a common web
#browser
wget wait=15 r U Mozilla http://www.website.com/index.html
#Some web servers may blacklist an IP if it notices that the all
#pages are being downloaded quickly. the –wait=15 option will
#prevent this:
Now let us execute wget with python
import os
cmd = 'wget r http://www.uoc.edu'
os.system(cmd)
# This download will download UOC ́s website to home directory, after
# downloading the file we can access HTML pages and process from
# local directory.
Or
import os
cmd ='wget wait=15 r U Mozilla http://www.uoc.edu'
os.system(cmd)
# This download will download UOC ́s website to home directory, each
#page is requested in 15 second interval.