如何从Word文件.doc,docx,.xlsx,.pptx php中提取文本

在某些情况下,我们可能需要从Word文档中获取文本以供将来在用户上传的文档中搜索字符串,例如在cv /

resumes中进行搜索,并出现一个常见的问题,即如何获取文本,打开并阅读用户上载Word文档时,有一些有用的链接,但不能解决整个问题。我们需要在上载时获取文本并将文本保存在数据库中,以便在数据库中轻松搜索。

回答:

class DocxConversion{

private $filename;

public function __construct($filePath) {

$this->filename = $filePath;

}

private function read_doc() {

$fileHandle = fopen($this->filename, "r");

$line = @fread($fileHandle, filesize($this->filename));

$lines = explode(chr(0x0D),$line);

$outtext = "";

foreach($lines as $thisline)

{

$pos = strpos($thisline, chr(0x00));

if (($pos !== FALSE)||(strlen($thisline)==0))

{

} else {

$outtext .= $thisline." ";

}

}

$outtext = preg_replace("/[^a-zA-Z0-9\s\,\.\-\n\r\t@\/\_\(\)]/","",$outtext);

return $outtext;

}

private function read_docx(){

$striped_content = '';

$content = '';

$zip = zip_open($this->filename);

if (!$zip || is_numeric($zip)) return false;

while ($zip_entry = zip_read($zip)) {

if (zip_entry_open($zip, $zip_entry) == FALSE) continue;

if (zip_entry_name($zip_entry) != "word/document.xml") continue;

$content .= zip_entry_read($zip_entry, zip_entry_filesize($zip_entry));

zip_entry_close($zip_entry);

}// end while

zip_close($zip);

$content = str_replace('</w:r></w:p></w:tc><w:tc>', " ", $content);

$content = str_replace('</w:r></w:p>', "\r\n", $content);

$striped_content = strip_tags($content);

return $striped_content;

}

/************************excel sheet************************************/

function xlsx_to_text($input_file){

$xml_filename = "xl/sharedStrings.xml"; //content file name

$zip_handle = new ZipArchive;

$output_text = "";

if(true === $zip_handle->open($input_file)){

if(($xml_index = $zip_handle->locateName($xml_filename)) !== false){

$xml_datas = $zip_handle->getFromIndex($xml_index);

$xml_handle = DOMDocument::loadXML($xml_datas, LIBXML_NOENT | LIBXML_XINCLUDE | LIBXML_NOERROR | LIBXML_NOWARNING);

$output_text = strip_tags($xml_handle->saveXML());

}else{

$output_text .="";

}

$zip_handle->close();

}else{

$output_text .="";

}

return $output_text;

}

/*************************power point files*****************************/

function pptx_to_text($input_file){

$zip_handle = new ZipArchive;

$output_text = "";

if(true === $zip_handle->open($input_file)){

$slide_number = 1; //loop through slide files

while(($xml_index = $zip_handle->locateName("ppt/slides/slide".$slide_number.".xml")) !== false){

$xml_datas = $zip_handle->getFromIndex($xml_index);

$xml_handle = DOMDocument::loadXML($xml_datas, LIBXML_NOENT | LIBXML_XINCLUDE | LIBXML_NOERROR | LIBXML_NOWARNING);

$output_text .= strip_tags($xml_handle->saveXML());

$slide_number++;

}

if($slide_number == 1){

$output_text .="";

}

$zip_handle->close();

}else{

$output_text .="";

}

return $output_text;

}

public function convertToText() {

if(isset($this->filename) && !file_exists($this->filename)) {

return "File Not exists";

}

$fileArray = pathinfo($this->filename);

$file_ext = $fileArray['extension'];

if($file_ext == "doc" || $file_ext == "docx" || $file_ext == "xlsx" || $file_ext == "pptx")

{

if($file_ext == "doc") {

return $this->read_doc();

} elseif($file_ext == "docx") {

return $this->read_docx();

} elseif($file_ext == "xlsx") {

return $this->xlsx_to_text();

}elseif($file_ext == "pptx") {

return $this->pptx_to_text();

}

} else {

return "Invalid File Type";

}

}

}

Document_file_formatDoc文件是二进制blob。可以使用[fopen读取它们。虽然.docx文件只是zip文件和xml文件zipfile容器中的xml文件(源Wikipedia),您可以使用zip_open读取它们。

以上类的用法

$docObj = new DocxConversion("test.doc");

//$docObj = new DocxConversion("test.docx");

//$docObj = new DocxConversion("test.xlsx");

//$docObj = new DocxConversion("test.pptx");

echo $docText= $docObj->convertToText();

以上是 如何从Word文件.doc,docx,.xlsx,.pptx php中提取文本 的全部内容, 来源链接: utcz.com/qa/422821.html

回到顶部