Wednesday 1 October 2014

How to Extract list of URLs in a web page using python

I think this post is very useful for finding the url for downloading and extracting url using python code.
Here I am using "sgmllib" python built in module for finding urls.

Use below code and run it with any urls


__author__ = "Ashish jain (example@gmail.com)"
__version__ = "$Revision: 1.0 $"
__date__ = "$Date: 2014/10/01 21:57:19 $"
__license__ = "Python"
from sgmllib import SGMLParser
class URLLister(SGMLParser):
  def reset(self):
    SGMLParser.reset(self)
    self.urls = []
  def start_a(self, attrs):
    href = [v for k, v in attrs if k=='href']
    if href:
      self.urls.extend(href)
if __name__ == "__main__":
  import urllib
  usock = urllib.urlopen("http://diveintopython.net/")
  parser = URLLister()
  parser.feed(usock.read())
  parser.close()
  usock.close()
  for url in parser.urls:
   print url

Thank you guys for your support.

No comments:

Post a Comment