如何在Java中解析大(50 GB)XML文件

目前,我正在尝试使用SAX解析器,但是通过文件它大约3/4完全冻结了,我尝试分配更多的内存等,但没有得到任何改善。

有什么办法可以加快速度吗?更好的方法?

剥开它的骨头,所以我现在有了以下代码,并且在命令行中运行时,它的运行速度还没有达到我想要的速度。

使用“ java -Xms-4096m -Xmx8192m -jar reader.jar”运行它,得到的GC开销限制超出了文章700000

主要:

public class Read {

public static void main(String[] args) {

pages = XMLManager.getPages();

}

}

XML管理器

public class XMLManager {

public static ArrayList<Page> getPages() {

ArrayList<Page> pages = null;

SAXParserFactory factory = SAXParserFactory.newInstance();

try {

SAXParser parser = factory.newSAXParser();

File file = new File("..\\enwiki-20140811-pages-articles.xml");

PageHandler pageHandler = new PageHandler();

parser.parse(file, pageHandler);

pages = pageHandler.getPages();

} catch (ParserConfigurationException e) {

e.printStackTrace();

} catch (SAXException e) {

e.printStackTrace();

} catch (IOException e) {

e.printStackTrace();

}

return pages;

}

}

页面处理程序

public class PageHandler extends DefaultHandler{

private ArrayList<Page> pages = new ArrayList<>();

private Page page;

private StringBuilder stringBuilder;

private boolean idSet = false;

public PageHandler(){

super();

}

@Override

public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {

stringBuilder = new StringBuilder();

if (qName.equals("page")){

page = new Page();

idSet = false;

} else if (qName.equals("redirect")){

if (page != null){

page.setRedirecting(true);

}

}

}

@Override

public void endElement(String uri, String localName, String qName) throws SAXException {

if (page != null && !page.isRedirecting()){

if (qName.equals("title")){

page.setTitle(stringBuilder.toString());

} else if (qName.equals("id")){

if (!idSet){

page.setId(Integer.parseInt(stringBuilder.toString()));

idSet = true;

}

} else if (qName.equals("text")){

String articleText = stringBuilder.toString();

articleText = articleText.replaceAll("(?s)<ref(.+?)</ref>", " "); //remove references

articleText = articleText.replaceAll("(?s)\\{\\{(.+?)\\}\\}", " "); //remove links underneath headings

articleText = articleText.replaceAll("(?s)==See also==.+", " "); //remove everything after see also

articleText = articleText.replaceAll("\\|", " "); //Separate multiple links

articleText = articleText.replaceAll("\\n", " "); //remove new lines

articleText = articleText.replaceAll("[^a-zA-Z0-9- \\s]", " "); //remove all non alphanumeric except dashes and spaces

articleText = articleText.trim().replaceAll(" +", " "); //convert all multiple spaces to 1 space

Pattern pattern = Pattern.compile("([\\S]+\\s*){1,75}"); //get first 75 words of text

Matcher matcher = pattern.matcher(articleText);

matcher.find();

try {

page.setSummaryText(matcher.group());

} catch (IllegalStateException se){

page.setSummaryText("None");

}

page.setText(articleText);

} else if (qName.equals("page")){

pages.add(page);

page = null;

}

} else {

page = null;

}

}

@Override

public void characters(char[] ch, int start, int length) throws SAXException {

stringBuilder.append(ch,start, length);

}

public ArrayList<Page> getPages() {

return pages;

}

}

回答:

您的解析代码可能工作正常,但您正在加载的数据量可能太大而无法保存在其中ArrayList

您需要某种流水线将数据传递到其实际目的地,而不必一次将所有数据都存储在内存中。

我有时针对这种情况所做的工作与以下类似。

创建用于处理单个元素的接口:

public interface PageProcessor {

void process(Page page);

}

PageHandler通过构造函数将此实现提供给:

public class Read  {

public static void main(String[] args) {

XMLManager.load(new PageProcessor() {

@Override

public void process(Page page) {

// Obviously you want to do something other than just printing,

// but I don't know what that is...

System.out.println(page);

}

}) ;

}

}

public class XMLManager {

public static void load(PageProcessor processor) {

SAXParserFactory factory = SAXParserFactory.newInstance();

try {

SAXParser parser = factory.newSAXParser();

File file = new File("pages-articles.xml");

PageHandler pageHandler = new PageHandler(processor);

parser.parse(file, pageHandler);

} catch (ParserConfigurationException e) {

e.printStackTrace();

} catch (SAXException e) {

e.printStackTrace();

} catch (IOException e) {

e.printStackTrace();

}

}

}

将数据发送到此处理器,而不是将其放在列表中:

public class PageHandler extends DefaultHandler {

private final PageProcessor processor;

private Page page;

private StringBuilder stringBuilder;

private boolean idSet = false;

public PageHandler(PageProcessor processor) {

this.processor = processor;

}

@Override

public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {

//Unchanged from your implementation

}

@Override

public void characters(char[] ch, int start, int length) throws SAXException {

//Unchanged from your implementation

}

@Override

public void endElement(String uri, String localName, String qName) throws SAXException {

// Elide code not needing change

} else if (qName.equals("page")){

processor.process(page);

page = null;

}

} else {

page = null;

}

}

}

当然,您可以使您的界面处理多条记录而不是仅处理一条记录,并将PageHandler收集页面本地放在较小的列表中,并定期将列表发送出去进行处理并清除列表。

或者(也许更好),您可以实现PageProcessor此处定义的接口,并在其中构建逻辑来缓冲数据并将其发送以进一步进行大块处理。

以上是 如何在Java中解析大(50 GB)XML文件 的全部内容, 来源链接: utcz.com/qa/425495.html

回到顶部