Python Crawlers: June 2015

Thursday, June 11, 2015

Web Crawling - Using regex module (Python)

Depth & Scrape Website for HTML Code & Create csv file containing Links # This script uses the regex and urllib modules to crawl a list of sites for title tags.
# It simply provides the title tags, i.e., titles you would see on web browser tabs.
# By Rami Jaloudi, Programmer
import re, urllib
try:
    import urllib.request # this is if you are running on Python 3
except:
    pass

sites = 'cnn nytimes bloomberg'.split()

pat = re.compile(r'<title>.+?</title>+', re.I|re.M)

for s in sites:
    print('Searching:' + s)
    try:
        u = urllib.urlopen('http://' + s + '.com')
    except:
        u = urllib.request.urlopen('http://' + s + '.com')
    text = u.read()
    title = re.findall(pat, str(text))
    print title
    print '\n'

Web Crawling - Using urllib module (Python)

Depth & Scrape Website for HTML Code & Create csv file containing Links import urllib

# Web Crawling - Using urllib module (Python) # By Rami Jaloudi, Programmer # This script runs on Python 2.7
# If using Python 3, please note that the urllib module has been split into parts and renamed as follows:
# urllib.request, urllib.parse, and urllib.error

url = "http://www.________.com" # Need to enter URL

request = urllib.urlopen(url)

response = request.read()

print response