python字符串查找函数没有给出由美丽的邮件返回的文本的位置
我试图抓取10-K文件的一部分。我有一个问题来确定'项目7(a)'的位置。从beautifulsoup返回的文本,尽管它有单词。但是下面的代码正在处理我制作的包含'item 7(a)'的字符串。python字符串查找函数没有给出由美丽的邮件返回的文本的位置
import urllib2 import re
import bs4 as bs
url=https://www.sec.gov/Archives/edgar/data/1580608/000158060817000015/santander201610-k.htm'
html = urllib2.urlopen(url).read().decode('utf8')
soup = bs.BeautifulSoup(html,'lxml')
text = soup.get_text()
text = text.encode('utf-8')
text = text.lower()
print type(text)
print len(text)
text1 = "hf dfbd item 7. abcd sfjsdf sdbfjkds item 7(a). adfbdf item 8. skjfbdk item 7. sdfkba ootgf sffdfd item 7(a). sfbdskf sfdf item 8. sdfbksdf "
print text.find('item 7(a)')
print text1.find('item 7(a)')
Output:
<type 'str'>
592214
-1
37
回答:
页在文本ITEM 7(A)
(
Ñ OT
乙 reaking
SP ACE)(使用char码
160
)的
代替正常空间(代码
32
)
您可以用代码替换所有的字符210(chr(160)
)与正常空间(" "
)。
在Python 2,你(编码后)有替代两个字符 - 194
和160
text = text.replace(chr(160), " ") # Python 3 text = text.replace(char(194)+chr(160), " ") # Python 2
完整的示例
#import urllib.request as urllib2 # Python 3 import urllib2
import re
import bs4 as bs
url='https://www.sec.gov/Archives/edgar/data/1580608/000158060817000015/santander201610-k.htm'
html = urllib2.urlopen(url).read().decode('utf8')
soup = bs.BeautifulSoup(html,'lxml')
text = soup.get_text()
text = text.encode('utf-8') # only Python 2
text = text.lower()
#text = text.replace(chr(160), " ") # Python 3
text = text.replace(char(194)+chr(160), " ") # Python 2
search = 'item 7(a)'
# find every occurence in text
pos = 0
while True:
pos = text.find(search, pos)
if pos == -1:
break
#print(pos, ">"+text[pos-1]+"<", ord(text[pos-1]))
print(text[pos:pos+20])
pos += 1
编辑:只测试与Python 3
你可以搜索字符串后,搜索字符串'item 7(a)'
。
但是你必须知道你必须在这个地方使用
而不是" "
。
from html import unescape search = unescape('item 7(a)')
的完整代码
#import urllib.request as urllib2 # Python 3 import urllib2
import re
import bs4 as bs
url='https://www.sec.gov/Archives/edgar/data/1580608/000158060817000015/santander201610-k.htm'
html = urllib2.urlopen(url).read().decode('utf8')
soup = bs.BeautifulSoup(html,'lxml')
text = soup.get_text()
text = text.lower()
from html import unescape
search = unescape('item 7(a)')
# find every occurence in text
pos = 0
while True:
pos = text.find(search, pos)
if pos == -1:
break
#print(pos, ">"+text[pos-1]+"<", ord(text[pos-1]))
print(text[pos:pos+20])
pos += 1
以上是 python字符串查找函数没有给出由美丽的邮件返回的文本的位置 的全部内容, 来源链接: utcz.com/qa/258981.html