jsoup获得div元素的类
我是Jsoup解析的新手,我想要获得本页面上所有公司的列表: 现在,一种方法是使用div标签检查页面与我需要的相关。 然而,当我打电话的方法:jsoup获得div元素的类
Document doc = Jsoup.connect("https://angel.co/companies?company_types[]=Startup").get(); System.out.println(doc.html());
首先,我甚至不能找到我的康索尔HTML输出那些DIV标签,(这是为了给公司的名单中的) 其次,即使我没有找到它,我怎么能找到一定的div元素的类名:
div class=" dc59 frw44 _a _jm"
赦免的行话,我不知道如何去通过这一点。
回答:
的数据没有被嵌入在网页,但他们使用的是后续API调用检索:
- 一个POST https://angel.co/company_filters/search_data得到一个
ids
阵列&名为hexdigest
- 令牌的GET https://angel.co/companies/startups检索公司数据使用前一个请求的输出
以上是针对每个页面重复的(因此新的令牌为&每个页面都需要一个id列表)。您可以在网络标签中使用Chrome开发人员控制台查看此过程。
第一个POST
请求给出JSON输出,但第二个请求(GET
)给出了JSON对象属性中的HTML数据。
下提取公司过滤器:
private static CompanyFilter getCompanyFilter(final String filter, final int page) throws IOException { String response = Jsoup.connect("https://angel.co/company_filters/search_data")
.header("Content-Type", "application/x-www-form-urlencoded;charset=UTF-8")
.header("X-Requested-With", "XMLHttpRequest")
.data("filter_data[company_types][]=", filter)
.data("sort", "signal")
.data("page", String.valueOf(page))
.userAgent("Mozilla")
.ignoreContentType(true)
.post().body().text();
GsonBuilder gsonBuilder = new GsonBuilder();
Gson gson = gsonBuilder.create();
return gson.fromJson(response, CompanyFilter.class);
}
然后下面的提取物企业:
private static List<Company> getCompanies(final CompanyFilter companyFilter) throws IOException { List<Company> companies = new ArrayList<>();
URLConnection urlConn = new URL("https://angel.co/companies/startups?" + companyFilter.buildRequest()).openConnection();
urlConn.setRequestProperty("User-Agent", "Mozilla");
urlConn.connect();
BufferedReader reader = new BufferedReader(new InputStreamReader(urlConn.getInputStream(), "UTF-8"));
HtmlContainer htmlObj = new Gson().fromJson(reader, HtmlContainer.class);
Element doc = Jsoup.parse(htmlObj.getHtml());
Elements data = doc.select("div[data-_tn]");
if (data.size() > 0) {
for (int i = 2; i < data.size(); i++) {
companies.add(new Company(data.get(i).select("a").first().attr("title"),
data.get(i).select("a").first().attr("href"),
data.get(i).select("div.pitch").first().text()));
}
} else {
System.out.println("no data");
}
return companies;
}
主要功能:
public static void main(String[] args) throws IOException { int pageCount = 1;
List<Company> companies = new ArrayList<>();
for (int i = 0; i < 10; i++) {
System.out.println("get page n°" + pageCount);
CompanyFilter companyFilter = getCompanyFilter("Startup", pageCount);
pageCount++;
System.out.println("digest : " + companyFilter.getDigest());
System.out.println("count : " + companyFilter.getTotalCount());
System.out.println("array size : " + companyFilter.getIds().size());
System.out.println("page : " + companyFilter.getpage());
companies.addAll(getCompanies(companyFilter));
if (companies.size() == 0) {
break;
} else {
System.out.println("size : " + companies.size());
}
}
}
Company
,CompanyFilter
& HtmlContainer
是模型类:
class CompanyFilter { @SerializedName("ids")
private List<Integer> mIds;
@SerializedName("hexdigest")
private String mDigest;
@SerializedName("total")
private String mTotalCount;
@SerializedName("page")
private int mPage;
@SerializedName("sort")
private String mSort;
@SerializedName("new")
private boolean mNew;
public List<Integer> getIds() {
return mIds;
}
public String getDigest() {
return mDigest;
}
public String getTotalCount() {
return mTotalCount;
}
public int getpage() {
return mPage;
}
private String buildRequest() {
String out = "total=" + mTotalCount + "&";
out += "sort=" + mSort + "&";
out += "page=" + mPage + "&";
out += "new=" + mNew + "&";
for (int i = 0; i < mIds.size(); i++) {
out += "ids[]=" + mIds.get(i) + "&";
}
out += "hexdigest=" + mDigest + "&";
return out;
}
}
private static class Company {
private String mLink;
private String mName;
private String mDescription;
public Company(String name, String link, String description) {
mLink = link;
mName = name;
mDescription = description;
}
public String getLink() {
return mLink;
}
public String getName() {
return mName;
}
public String getDescription() {
return mDescription;
}
}
private static class HtmlContainer {
@SerializedName("html")
private String mHtml;
public String getHtml() {
return mHtml;
}
}
完整的代码也可以here
以上是 jsoup获得div元素的类 的全部内容, 来源链接: utcz.com/qa/258722.html