node.js - Parse broken HTML Code using Nodejs & Cheerio -
i trying scrape pure static html page tabular data in using nodejs & cheerio. problem that, page trying scrape doesnt have proper html dom. mean, there many opening tags not closed. there other closing tags(</table>
) has no openings.
a sample code (alert: code close real sample & html broken)
<body topmargin="0" leftmargin="0" marginheight="0" marginwidth="0" bgcolor="#ffffff" text="#000000" link="#003399" vlink="#003399" alink="#ff8000"> <table border="0" cellpadding="0" cellspacing="0" width="100%"> <tr><td bgcolor="#445bc6">hii</td></tr> <tr><td></td></tr> <tr> <td align="right" bgcolor="#d9d9e8" width="100%"> <p class="menu"><b><font color="#000000"><a href="details.php?type=contact&npo_id=18430">individuals</a></font></b> </td> </tr> </table> <p> <table cellpadding=8><tr><td> </td><td> <table cellpadding=8 style="border-collapse: collapse" border=1 width=80% align=cemter> <tr><td bgcolor="d8d8c4" valign=top align=right><p><b>data 1</b></td> <td><p><b>data 2</b></td> </tr> <tr><td bgcolor="d8d8c4" valign=top align=right><p><b>data 3</b></td> <td><p>data 4</td> </tr> </table> </td></tr></table> <tr> <td width="100%" valign="bottom" colspan="2" align="center"> <p> <a href="#top">another dirty content</a><br> <a href="#top"><font color="#000000">table wrong</font></a></p> </td> </tr></table></div>
as 1 can see there p tags not closing.. @ bottom there </table>
& </div>
tags not opening. how fetch data1, data2, data3, data4 using cheerio & nodejs ? other library efficient in parsing such data
edit(solution): problem solved. did converted html tags lower-case , worked fine..am not sure why lower-case important worked cheerio
cheerio
built around htmlparser2
, supposed "forgiving". if doesn't parse page, , know against conventional wisdom, parse using regular expressions. assuming page structure won't change much, , it's 1 page trying parse.
also, noticed link @ top of sample html, individuals.php
. data after there, in different, more parseable format?
oh, , respect people's privacy, , sites usage terms, when scraping.
Comments
Post a Comment