python - Extract java script from html document using regular expression -

September 15, 2013

i trying extract java script google.com using regular expression.

program

import urllib import re gdoc = urllib.urlopen('http://google.com').read() scriptlis = re.findall(r'<script>(.*?)</script>', gdoc) print scriptlis

output:

['']

can 1 tell me how extract java script html doc using regular expression only.

this works:

import urllib import re gdoc = urllib.urlopen('http://google.com').read() scriptlis = re.findall('(?si)<script>(.*?)</script>', gdoc) print scriptlis

the key here (?si). "s" sets "dotall" flag (same re.dotall), makes regex match on newlines. root of problem. scripts on google.com span multiple lines, regex can't match them unless tell include newlines in (.*?).

the "i" sets "ignorcase" flag (same re.ignorecase), allows match can javascript. now, isn't entirely necessary because google codes pretty well. but, if had poor code did stuff similar <script>...</script>, need flag.

Search This Blog

Copy

python - Extract java script from html document using regular expression -

Comments

Post a Comment

Popular posts from this blog

matlab - Deleting rows with specific rules -

asp.net - redirect .aspx with query string to html page using htaccess -

image - ClassNotFoundException when add a prebuilt apk into system.img in android -