刮使用美丽的汤preserving&放大器; NBSP;实体 [英] Scrape using Beautiful Soup preserving   entities
问题描述
我想从网上刮一张桌子和保持&放大器; NBSP;完整的实体,这样我以后可以重新发布为HTML。 BeautifulSoup虽然似乎被转换到这些空间。例如:
从BS4进口BeautifulSoupHTML =< HTML和GT;<身体GT;<表>< TR>中
HTML + =< TD>&安培; NBSP;&打招呼放大器; NBSP;< / TD>中
HTML + =< / TR>< /表>< /身体GT;< / HTML>中汤= BeautifulSoup(HTML)
表= soup.find_all('表')[0]
行= table.find_all('TR')[0]
细胞= row.find_all('TD')[0]打印单元
观察结果是:
< TD>你好< / TD>
所需的结果:
< TD>&安培; NBSP;&打招呼放大器; NBSP;< / TD>
在BS4 convertEntities
参数BeautifulSoup构造方法不再支持。 HTML实体总是转换成相应的Uni code字符(请参阅文档)。
据文档,你需要使用的输出格式,如:
打印soup.find_all('TD')[0] prettify(格式=HTML)
I would like to scrape a table from the web and keep the entities intact so that I can republish as HTML later. BeautifulSoup seems to be converting these to spaces though. Example:
from bs4 import BeautifulSoup
html = "<html><body><table><tr>"
html += "<td> hello </td>"
html += "</tr></table></body></html>"
soup = BeautifulSoup(html)
table = soup.find_all('table')[0]
row = table.find_all('tr')[0]
cell = row.find_all('td')[0]
print cell
observed result:
<td> hello </td>
required result:
<td> hello </td>
In bs4 convertEntities
parameter to BeautifulSoup constructor is not supported anymore. HTML entities are always converted into the corresponding Unicode characters (see docs).
According to docs, you need to use an output formatter, like this:
print soup.find_all('td')[0].prettify(formatter="html")
这篇关于刮使用美丽的汤preserving&放大器; NBSP;实体的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!