javascript - Of scraping data, headless browsers, and Python -
so i'm cs student trying learn web scraping , do's , dont's come along it. after messing imacros , few other data scraping 'tools', turned python, language not familiar @ time. learned beautifulsoup , urllib2, , blundered way through learning through stackoverflow , few other forums.
now, using knowledge ive gained far, can scrape static web pages. however, know era of static pages over, js reigns supreme on mediocre websites now.
i please guide me in right direction here. want learn method load javascript-laden webpages, load content, , somehow data beautifulsoup function. urllib2 sucks @ that. ability fill in forms , navigate through button clicks.
mostly websites im interested in consist of long list of results load scroll down. loading them , downloading page doesnt seem help(dont know why is). i'm using windows 7, , have python 2.7.5 installed.
i've been told headless browsers such zombie or ghost me, dont know those. tried using libraries such mechanize dont cater need, i.e, loading results, fetching webpage, , feeding bs4.
bearing in mind minimal knowledge of python, me out here?
thanks
selenium webdriver phantomjs can headless automated browsing of javascript-driven webpages. once installed, can used this:
import contextlib import selenium.webdriver webdriver import bs4 bs # define path phantomjs binary phantomjs = 'phantomjs' url = ... contextlib.closing(webdriver.phantomjs(phantomjs)) driver: driver.get(url) content = driver.page_source soup = bs.beautifulsoup(content)
on ubuntu, can installed with
sudo pip install -u selenium
- download , unpack phantomjs
link or move phantomjs binary directory in path
% cd phantomjs-1.9.0-linux-i686/bin/ % ln phantomjs ~/bin
Comments
Post a Comment