beautifulsoup - split a table into several with Beautiful Soup [Python] -


i need problem can't find out...

i have html table tr , td:

for example:

<table border="0" cellpadding="0" cellspacing="0">     <tr>      <td>      </td>     </tr>     <tr>      <td colspan="2">       <br />       <h2>        macros       </h2>      </td>     </tr>     <tr>      <td>       #define&nbsp;      </td>      <td>       <a class="el" href="#g3e3da223d2db3b49a9b6e3ee6f49f745">        snd_lstindic       </a>      </td>     </tr>     <tr>      <td class="mdescleft">       &nbsp;      </td>      <td class="mdescright">       liste sons indication       <br />      </td>     </tr>     <tr>      <td colspan="2">       <br />       <h2>        définition de type       </h2>      </td>     </tr>     <tr>      <td class="memitemleft" nowrap="nowrap" align="right" valign="top">       typedef void(*&nbsp;      </td>      <td class="memitemright" valign="bottom">       <a class="el" href="#g73cba8bd62d629eb05495a5c1a7b2844">        f_sndchangefunc       </a>       )(       <a class="el" href="#g4ab7db37a42f244764583a63997489a8">        e_sndsound       </a>       i_esound,     abool     i_bstart,     abyte     i_bydisablemodule)      </td>     </tr>     <tr>      <td class="mdescleft">       &nbsp;      </td>      <td class="mdescright">       fonction rappel sur départ/arrêt bip.       <a href="#g73cba8bd62d629eb05495a5c1a7b2844">       </a>       <br />      </td>     </tr>     <tr>      <td colspan="2">       <br />       <h2>        Énumérations       </h2>      </td>     </tr>     <tr>      <td class="memitemleft" nowrap="nowrap" align="right" valign="top">       enum &nbsp;      </td>      <td class="memitemright" valign="bottom">       <a class="el" href="#g4ab7db37a42f244764583a63997489a8">        e_sndsound       </a>       {       }      </td>     </tr>     <tr>      <td class="mdescleft">       &nbsp;      </td>      <td class="mdescright">       identificateurs sons       <a href="group__sound.html#g4ab7db37a42f244764583a63997489a8">        plus de détails...       </a>       <br />      </td>     </tr> </table> 

i try split table several one. out

title , create table following lines.

for example expected result here should this:

<h2>   macros </h2> <table border="0" cellpadding="0" cellspacing="0">     <tr>      <td>      </td>     </tr>     <tr>      <td colspan="2">       <br />      </td>     </tr>     <tr>      <td>       #define&nbsp;      </td>      <td>       <a class="el" href="#g3e3da223d2db3b49a9b6e3ee6f49f745">        snd_lstindic       </a>      </td>     </tr>     <tr>      <td class="mdescleft">       &nbsp;      </td>      <td class="mdescright">       liste sons indication       <br />      </td>     </tr>   </table>    <h2>     définition de type   </h2>   <table>     <tr>      <td class="memitemleft" nowrap="nowrap" align="right" valign="top">       typedef void(*&nbsp;      </td>      <td class="memitemright" valign="bottom">       <a class="el" href="#g73cba8bd62d629eb05495a5c1a7b2844">        f_sndchangefunc       </a>       )(       <a class="el" href="#g4ab7db37a42f244764583a63997489a8">        e_sndsound       </a>       i_esound,     abool     i_bstart,     abyte     i_bydisablemodule)      </td>     </tr>     <tr>      <td class="mdescleft">       &nbsp;      </td>      <td class="mdescright">       fonction rappel sur départ/arrêt bip.       <a href="#g73cba8bd62d629eb05495a5c1a7b2844">       </a>       <br />      </td>     </tr>   </table>    <h2>     Énumérations   </h2>   <table>     <tr>      <td class="memitemleft" nowrap="nowrap" align="right" valign="top">       enum &nbsp;      </td>      <td class="memitemright" valign="bottom">       <a class="el" href="#g4ab7db37a42f244764583a63997489a8">        e_sndsound       </a>       {       }      </td>     </tr>     <tr>      <td class="mdescleft">       &nbsp;      </td>      <td class="mdescright">       identificateurs sons       <a href="group__sound.html#g4ab7db37a42f244764583a63997489a8">        plus de détails...       </a>       <br />      </td>     </tr> </table> 

i use python , beautifulsoup in order parse html code. tried first :

from beautifulsoup import beautifulsoup, navigablestring import sys import os  soup = beautifulsoup(allhtml)  table in htmlsoup.findall("table"):    h2s = table.findall("h2")       if h2s not []:                firsth2 = true          lasth2 = false          i, h2 in enumerate(h2s):             if h2 not []:                lasth2 = ( == len(h2s) - 1 )                 h2.parent.replacewithchildren() # <td> deleted                h2.parent.replacewithchildren() # <tr> deleted                print h2.parent                if firsth2:                   h2.replacewith( h2.prettify() + '<table>' )                   #h2_tag_idx = h2.parent.contents.index(h2) # other method add tags                   #h2.parent.insert(h2_tag_idx + 1, '<b>ok</b>')                else:                   h2.replacewith( '</table>' + h2.prettify() + '<table>' )                 firsth2 = false  print soup.prettify() 

but no way, replace tag html équivalent ascii code...

i tried every contents in table , after try rebuild several table en put again in soup failed...

i tried table in string , split string delimiter , reput subtable soup failed too...

if has idea, great!

thanks in advance!

i made function , works...

def getouttitlefromtable(htmlsoup):    ii, table in enumerate(htmlsoup.findall("table")):       h2s = table.findall("h2") # on cherche tous les <h2></h2> dans le tableau       #print h2s       if len(h2s) > 0: #si on au moins 1 <h2> dans le tableau             firsth2 = true          lasth2 = false          newtables = beautifulsoup() # contiendra nos tableaux reconstitués          i, h2 in enumerate(h2s):             if h2 not []:                lasth2 = ( == len(h2s) - 1 )                h2.parent.replacewithchildren() # on supprime le <td>                h2.parent.replacewithchildren() # on supprime le <tr>                 idt = "table"+str(ii)+str(i) # création d'un id de tableau pour une meilleure lisibilité                wraptable = tag(htmlsoup, "table")                wraptable["id"]=idt                wraptable["border"]=0                wraptable["cellpadding"]=0                wraptable["cellspacing"]=0                #print h2.parent.contents.index(h2) # index du h2 dans l'arbre table                table.insert(h2.parent.contents.index(h2)+1, wraptable) # on ajoute <table></table> après chaque <h2>"title"</h2>                #newtable = table.findall("table")                newtable = table.find(name="table", attrs={"id" : idt})                filltable = false                #print table.findall(["h2","tr"])                tr in table.findall(["h2","tr"]):                   if filltable:                      if tr in h2s:                         #print "fin du nouveau tableau"                         #print tr                         filltable = false                         break                      else:                         if tr.find("h2") not in h2s:                            #print "ajout d'une nouvelle ligne: "                            newtable.contents.append(tr)                            #print newtable.contents                    if str(tr) == str(h2):                      #print "début du nouveau tableau"                      #print tr                      filltable = true                 newtables.append(h2)                newtables.append(newtable)                 #os.system("pause")                 #print h2                #print firsth2                #print lasth2                firsth2 = false           #print newtables          table.contents = newtables          table.name = "div" # on change la balise table en div... on triche mais je n'arrive absolument pas à retirer le wrap <table></table> 

if has better solution, enjoy @ it.

bye


Comments

Popular posts from this blog

image - ClassNotFoundException when add a prebuilt apk into system.img in android -

I need to import mysql 5.1 to 5.5? -

Java, Hibernate, MySQL - store UTC date-time -