javascript - Of scraping data, headless browsers, and Python -

September 15, 2014

so i'm cs student trying learn web scraping , do's , dont's come along it. after messing imacros , few other data scraping 'tools', turned python, language not familiar @ time. learned beautifulsoup , urllib2, , blundered way through learning through stackoverflow , few other forums.

now, using knowledge ive gained far, can scrape static web pages. however, know era of static pages over, js reigns supreme on mediocre websites now.

i please guide me in right direction here. want learn method load javascript-laden webpages, load content, , somehow data beautifulsoup function. urllib2 sucks @ that. ability fill in forms , navigate through button clicks.

mostly websites im interested in consist of long list of results load scroll down. loading them , downloading page doesnt seem help(dont know why is). i'm using windows 7, , have python 2.7.5 installed.

i've been told headless browsers such zombie or ghost me, dont know those. tried using libraries such mechanize dont cater need, i.e, loading results, fetching webpage, , feeding bs4.

bearing in mind minimal knowledge of python, me out here?

thanks

selenium webdriver phantomjs can headless automated browsing of javascript-driven webpages. once installed, can used this:

import contextlib import selenium.webdriver webdriver import bs4 bs  # define path phantomjs binary phantomjs = 'phantomjs' url = ... contextlib.closing(webdriver.phantomjs(phantomjs)) driver:     driver.get(url)     content = driver.page_source     soup = bs.beautifulsoup(content)

on ubuntu, can installed with

sudo pip install -u selenium
download , unpack phantomjs

link or move phantomjs binary directory in path

% cd phantomjs-1.9.0-linux-i686/bin/ % ln phantomjs ~/bin

Search This Blog

Copy

javascript - Of scraping data, headless browsers, and Python -

Comments

Post a Comment

Popular posts from this blog

matlab - Deleting rows with specific rules -

asp.net - redirect .aspx with query string to html page using htaccess -

image - ClassNotFoundException when add a prebuilt apk into system.img in android -