python字符串查找函数没有给出由美丽的邮件返回的文本的位置

我试图抓取10-K文件的一部分。我有一个问题来确定'项目7(a)'的位置。从beautifulsoup返回的文本,尽管它有单词。但是下面的代码正在处理我制作的包含'item 7(a)'的字符串。python字符串查找函数没有给出由美丽的邮件返回的文本的位置

import urllib2 

import re

import bs4 as bs

url=https://www.sec.gov/Archives/edgar/data/1580608/000158060817000015/santander201610-k.htm'

html = urllib2.urlopen(url).read().decode('utf8')

soup = bs.BeautifulSoup(html,'lxml')

text = soup.get_text()

text = text.encode('utf-8')

text = text.lower()

print type(text)

print len(text)

text1 = "hf dfbd item 7. abcd sfjsdf sdbfjkds item 7(a). adfbdf item 8. skjfbdk item 7. sdfkba ootgf sffdfd item 7(a). sfbdskf sfdf item 8. sdfbksdf "

print text.find('item 7(a)')

print text1.find('item 7(a)')

Output:

<type 'str'>

592214

-1

37

回答:

页在文本ITEM 7(A)

使用实体

&nbsp;

Ñ OT

乙 reaking

SP ACE)(使用char码

160)的


代替正常空间(代码

32

您可以用代码替换所有的字符210(chr(160))与正常空间(" ")。
在Python 2,你(编码后)有替代两个字符 - 194160

text = text.replace(chr(160), " ") # Python 3 

text = text.replace(char(194)+chr(160), " ") # Python 2

完整的示例

#import urllib.request as urllib2 # Python 3 

import urllib2

import re

import bs4 as bs

url='https://www.sec.gov/Archives/edgar/data/1580608/000158060817000015/santander201610-k.htm'

html = urllib2.urlopen(url).read().decode('utf8')

soup = bs.BeautifulSoup(html,'lxml')

text = soup.get_text()

text = text.encode('utf-8') # only Python 2

text = text.lower()

#text = text.replace(chr(160), " ") # Python 3

text = text.replace(char(194)+chr(160), " ") # Python 2

search = 'item 7(a)'

# find every occurence in text

pos = 0

while True:

pos = text.find(search, pos)

if pos == -1:

break

#print(pos, ">"+text[pos-1]+"<", ord(text[pos-1]))

print(text[pos:pos+20])

pos += 1

编辑:只测试与Python 3

你可以搜索字符串后,搜索字符串'item&nbsp;7(a)'
但是你必须知道你必须在这个地方使用&nbsp;而不是" "

from html import unescape 

search = unescape('item&nbsp;7(a)')

的完整代码

#import urllib.request as urllib2 # Python 3 

import urllib2

import re

import bs4 as bs

url='https://www.sec.gov/Archives/edgar/data/1580608/000158060817000015/santander201610-k.htm'

html = urllib2.urlopen(url).read().decode('utf8')

soup = bs.BeautifulSoup(html,'lxml')

text = soup.get_text()

text = text.lower()

from html import unescape

search = unescape('item&nbsp;7(a)')

# find every occurence in text

pos = 0

while True:

pos = text.find(search, pos)

if pos == -1:

break

#print(pos, ">"+text[pos-1]+"<", ord(text[pos-1]))

print(text[pos:pos+20])

pos += 1

以上是 python字符串查找函数没有给出由美丽的邮件返回的文本的位置 的全部内容, 来源链接: utcz.com/qa/258981.html

回到顶部