jsoup获得div元素的类

我是Jsoup解析的新手,我想要获得本页面上所有公司的列表: 现在,一种方法是使用div标签检查页面与我需要的相关。 然而,当我打电话的方法:jsoup获得div元素的类

Document doc = Jsoup.connect("https://angel.co/companies?company_types[]=Startup").get(); 

System.out.println(doc.html());

首先,我甚至不能找到我的康索尔HTML输出那些DIV标签,(这是为了给公司的名单中的) 其次,即使我没有找到它,我怎么能找到一定的div元素的类名:

div class=" dc59 frw44 _a _jm" 

赦免的行话,我不知道如何去通过这一点。

回答:

的数据没有被嵌入在网页,但他们使用的是后续API调用检索:

  • 一个POST https://angel.co/company_filters/search_data得到一个ids阵列&名为hexdigest
  • 令牌的GET https://angel.co/companies/startups检索公司数据使用前一个请求的输出

以上是针对每个页面重复的(因此新的令牌为&每个页面都需要一个id列表)。您可以在网络标签中使用Chrome开发人员控制台查看此过程。

第一个POST请求给出JSON输出,但第二个请求(GET)给出了JSON对象属性中的HTML数据。

下提取公司过滤器:

private static CompanyFilter getCompanyFilter(final String filter, final int page) throws IOException { 

String response = Jsoup.connect("https://angel.co/company_filters/search_data")

.header("Content-Type", "application/x-www-form-urlencoded;charset=UTF-8")

.header("X-Requested-With", "XMLHttpRequest")

.data("filter_data[company_types][]=", filter)

.data("sort", "signal")

.data("page", String.valueOf(page))

.userAgent("Mozilla")

.ignoreContentType(true)

.post().body().text();

GsonBuilder gsonBuilder = new GsonBuilder();

Gson gson = gsonBuilder.create();

return gson.fromJson(response, CompanyFilter.class);

}

然后下面的提取物企业:

private static List<Company> getCompanies(final CompanyFilter companyFilter) throws IOException { 

List<Company> companies = new ArrayList<>();

URLConnection urlConn = new URL("https://angel.co/companies/startups?" + companyFilter.buildRequest()).openConnection();

urlConn.setRequestProperty("User-Agent", "Mozilla");

urlConn.connect();

BufferedReader reader = new BufferedReader(new InputStreamReader(urlConn.getInputStream(), "UTF-8"));

HtmlContainer htmlObj = new Gson().fromJson(reader, HtmlContainer.class);

Element doc = Jsoup.parse(htmlObj.getHtml());

Elements data = doc.select("div[data-_tn]");

if (data.size() > 0) {

for (int i = 2; i < data.size(); i++) {

companies.add(new Company(data.get(i).select("a").first().attr("title"),

data.get(i).select("a").first().attr("href"),

data.get(i).select("div.pitch").first().text()));

}

} else {

System.out.println("no data");

}

return companies;

}

主要功能:

public static void main(String[] args) throws IOException { 

int pageCount = 1;

List<Company> companies = new ArrayList<>();

for (int i = 0; i < 10; i++) {

System.out.println("get page n°" + pageCount);

CompanyFilter companyFilter = getCompanyFilter("Startup", pageCount);

pageCount++;

System.out.println("digest : " + companyFilter.getDigest());

System.out.println("count : " + companyFilter.getTotalCount());

System.out.println("array size : " + companyFilter.getIds().size());

System.out.println("page : " + companyFilter.getpage());

companies.addAll(getCompanies(companyFilter));

if (companies.size() == 0) {

break;

} else {

System.out.println("size : " + companies.size());

}

}

}

CompanyCompanyFilter & HtmlContainer是模型类:

class CompanyFilter { 

@SerializedName("ids")

private List<Integer> mIds;

@SerializedName("hexdigest")

private String mDigest;

@SerializedName("total")

private String mTotalCount;

@SerializedName("page")

private int mPage;

@SerializedName("sort")

private String mSort;

@SerializedName("new")

private boolean mNew;

public List<Integer> getIds() {

return mIds;

}

public String getDigest() {

return mDigest;

}

public String getTotalCount() {

return mTotalCount;

}

public int getpage() {

return mPage;

}

private String buildRequest() {

String out = "total=" + mTotalCount + "&";

out += "sort=" + mSort + "&";

out += "page=" + mPage + "&";

out += "new=" + mNew + "&";

for (int i = 0; i < mIds.size(); i++) {

out += "ids[]=" + mIds.get(i) + "&";

}

out += "hexdigest=" + mDigest + "&";

return out;

}

}

private static class Company {

private String mLink;

private String mName;

private String mDescription;

public Company(String name, String link, String description) {

mLink = link;

mName = name;

mDescription = description;

}

public String getLink() {

return mLink;

}

public String getName() {

return mName;

}

public String getDescription() {

return mDescription;

}

}

private static class HtmlContainer {

@SerializedName("html")

private String mHtml;

public String getHtml() {

return mHtml;

}

}

完整的代码也可以here

以上是 jsoup获得div元素的类 的全部内容, 来源链接: utcz.com/qa/258722.html

回到顶部