Thursday, June 11, 2015

Web Crawling - Using regex module (Python)

Depth & Scrape Website for HTML Code & Create csv file containing Links # This script uses the regex and urllib modules to crawl a list of sites for title tags.
# It simply provides the title tags, i.e., titles you would see on web browser tabs.
# By Rami Jaloudi, Programmer
import re, urllib
try:
    import urllib.request # this is if you are running on Python 3
except:
    pass

sites = 'cnn nytimes bloomberg'.split()

pat = re.compile(r'<title>.+?</title>+', re.I|re.M)

for s in sites:
    print('Searching:' + s)
    try:
        u = urllib.urlopen('http://' + s + '.com')
    except:
        u = urllib.request.urlopen('http://' + s + '.com')
    text = u.read()
    title = re.findall(pat, str(text))
    print title
    print '\n'

No comments:

Post a Comment