javascript - Of scraping data, headless browsers, and Python -


so i'm cs student trying learn web scraping , do's , dont's come along it. after messing imacros , few other data scraping 'tools', turned python, language not familiar @ time. learned beautifulsoup , urllib2, , blundered way through learning through stackoverflow , few other forums.

now, using knowledge ive gained far, can scrape static web pages. however, know era of static pages over, js reigns supreme on mediocre websites now.

i please guide me in right direction here. want learn method load javascript-laden webpages, load content, , somehow data beautifulsoup function. urllib2 sucks @ that. ability fill in forms , navigate through button clicks.

mostly websites im interested in consist of long list of results load scroll down. loading them , downloading page doesnt seem help(dont know why is). i'm using windows 7, , have python 2.7.5 installed.

i've been told headless browsers such zombie or ghost me, dont know those. tried using libraries such mechanize dont cater need, i.e, loading results, fetching webpage, , feeding bs4.

bearing in mind minimal knowledge of python, me out here?

thanks

selenium webdriver phantomjs can headless automated browsing of javascript-driven webpages. once installed, can used this:

import contextlib import selenium.webdriver webdriver import bs4 bs  # define path phantomjs binary phantomjs = 'phantomjs' url = ... contextlib.closing(webdriver.phantomjs(phantomjs)) driver:     driver.get(url)     content = driver.page_source     soup = bs.beautifulsoup(content) 

on ubuntu, can installed with

  • sudo pip install -u selenium
  • download , unpack phantomjs
  • link or move phantomjs binary directory in path

    % cd phantomjs-1.9.0-linux-i686/bin/ % ln phantomjs ~/bin 

Comments

Popular posts from this blog

image - ClassNotFoundException when add a prebuilt apk into system.img in android -

I need to import mysql 5.1 to 5.5? -

Java, Hibernate, MySQL - store UTC date-time -