node.js - Parse broken HTML Code using Nodejs & Cheerio -

May 15, 2015

i trying scrape pure static html page tabular data in using nodejs & cheerio. problem that, page trying scrape doesnt have proper html dom. mean, there many opening tags not closed. there other closing tags(</table>) has no openings.

a sample code (alert: code close real sample & html broken)

  <body topmargin="0" leftmargin="0" marginheight="0" marginwidth="0" bgcolor="#ffffff" text="#000000" link="#003399" vlink="#003399" alink="#ff8000">     <table border="0" cellpadding="0" cellspacing="0" width="100%">         <tr><td bgcolor="#445bc6">hii</td></tr>         <tr><td></td></tr>         <tr>             <td align="right" bgcolor="#d9d9e8" width="100%">                 <p class="menu"><b><font color="#000000"><a href="details.php?type=contact&npo_id=18430">individuals</a></font></b>&nbsp;&nbsp;             </td>         </tr>     </table>     <p>     <table cellpadding=8><tr><td>&nbsp;</td><td>                 <table cellpadding=8 style="border-collapse: collapse" border=1 width=80% align=cemter>                      <tr><td bgcolor="d8d8c4" valign=top align=right><p><b>data 1</b></td>                         <td><p><b>data 2</b></td>                     </tr>                     <tr><td bgcolor="d8d8c4" valign=top align=right><p><b>data 3</b></td>                         <td><p>data 4</td>                     </tr>                   </table>             </td></tr></table>            <tr>     <td width="100%" valign="bottom" colspan="2" align="center">         <p>             <a href="#top">another dirty content</a><br>             <a href="#top"><font color="#000000">table wrong</font></a></p>     </td> </tr></table></div>

as 1 can see there p tags not closing.. @ bottom there </table> & </div> tags not opening. how fetch data1, data2, data3, data4 using cheerio & nodejs ? other library efficient in parsing such data

edit(solution): problem solved. did converted html tags lower-case , worked fine..am not sure why lower-case important worked cheerio

cheerio built around htmlparser2, supposed "forgiving". if doesn't parse page, , know against conventional wisdom, parse using regular expressions. assuming page structure won't change much, , it's 1 page trying parse.

also, noticed link @ top of sample html, individuals.php. data after there, in different, more parseable format?

oh, , respect people's privacy, , sites usage terms, when scraping.

Search This Blog

Copy

node.js - Parse broken HTML Code using Nodejs & Cheerio -

Comments

Post a Comment

Popular posts from this blog

matlab - Deleting rows with specific rules -

asp.net - redirect .aspx with query string to html page using htaccess -

image - ClassNotFoundException when add a prebuilt apk into system.img in android -