python - Extract java script from html document using regular expression -


i trying extract java script google.com using regular expression.

program

import urllib import re gdoc = urllib.urlopen('http://google.com').read() scriptlis = re.findall(r'<script>(.*?)</script>', gdoc) print scriptlis 

output:

[''] 

can 1 tell me how extract java script html doc using regular expression only.

this works:

import urllib import re gdoc = urllib.urlopen('http://google.com').read() scriptlis = re.findall('(?si)<script>(.*?)</script>', gdoc) print scriptlis 

the key here (?si). "s" sets "dotall" flag (same re.dotall), makes regex match on newlines. root of problem. scripts on google.com span multiple lines, regex can't match them unless tell include newlines in (.*?).

the "i" sets "ignorcase" flag (same re.ignorecase), allows match can javascript. now, isn't entirely necessary because google codes pretty well. but, if had poor code did stuff similar <script>...</script>, need flag.


Comments

Popular posts from this blog

image - ClassNotFoundException when add a prebuilt apk into system.img in android -

I need to import mysql 5.1 to 5.5? -

Java, Hibernate, MySQL - store UTC date-time -