python - how to extract all the data between -
<p align="justify"><a href="#abcd"> mr </a></p> <p align="justify">i </p> <p align="justify"> have question </p> <p align="justify"> </p> <p align="justify"><a href="#mnop"> mr b </a></p> <p align="justify">the </p> <p align="justify">answer is</p> <p align="justify">not there</p> <p align="justify"> </p> <p align="justify"><a href="wxyz"> mr c </a></p> <p align="justify">please</p> <p align="justify">help</p>
i want iterate extraction of data of
.
- the first iteration should display i have question
- second iteration should display the answer not there
- the person names should extracted in different list ..for example ['mr a','mr b','mr c']
if has idea how it, might useful because trying learn python got stuck problem.the code tried is
for t in soup.findall('p',text = re.compile(' '), attrs = {'align' : 'justify'}): print t item in t.parent.next_siblings: if isinstance(item, tag): if 'p' in item.attrs , 'align' in item.attrs['p']: break print item
it return [] not want
just method using regex:
from re import sub html = '<p align="justify">i </p>\ <p align="justify"> have question </p>\ <p align="justify"> </p>\ <p align="justify">the </p>\ <p align="justify">answer is</p>\ <p align="justify">not there</p>\ <p align="justify"> </p>\ <p align="justify">please</p>\ <p align="justify">help</p>' print [sub("\s+", " ", x).strip() x in sub("<.*?>", " ", html).split(" ")]
output:
['i have question', 'the answer not there', 'please help']
Comments
Post a Comment