От автора: не так давно на нашем сайте был опубликован урок по созданию документов MS Word средствами языка PHP, и с использованием специальной библиотеки PHPWord. Но в комментариях к данному видео – прозвучал вопрос, как при помощи данной библиотеки читать готовые документы, что собственно и подтолкнуло меня к записи данного урока, в котором мы с Вами научимся, используя выше указанную библиотеку, читать ранее созданные документы MSWord.
В данном уроке мы продолжаем изучать возможности PHPWord, а именно рассмотрим инструменты по чтению готовых документов MS Word. Хотел бы отметить, что сегодня мы будем работать с уже установленной библиотекой, потому как это уже второй урок по данной теме, а значит, на основах подробно останавливаться не будем. Поэтому рекомендую, перед просмотром данного видео ознакомиться с первой часть урока – PHPWord — создание MS Word документов средствами PHP.
Итак, заготовка, тестового скрипта состоит из одного единственного файла index.php, в коде которого выполнена установка библиотеки.
Итак, заготовка, тестового скрипта состоит из одного единственного файла index.php, в коде которого выполнена установка библиотеки.
|
require ‘vendor/autoload.php’; |
Для начала создадим переменную, в которой будет храниться путь к документу MSWord, с которым мы будем работать.
|
$source = __DIR__.«/docs/text.docx»; |
Далее, вспомним, что в начале работы с библиотекой необходимо создать объект главного класса PHPWord, но это в том случае если создается новый документ. Если же осуществляется чтение готового файла MS Word – объект указанного класса необходимо создать для интересующего документа, но перед этим его нужно прочитать.
Для чтения готовых документов в PHPWord предусмотрена группа классов, отвечающих за чтение документов различных форматов. А значит, первым делом создадим объект специального “класса-риддера“.
|
$objReader = PhpOfficePhpWordIOFactory::createReader(‘Word2007’); |
Далее, используя данный объект – выполним чтение документа формата MS Word.
|
$phpWord = $objReader—>load($source); |
Таким образом, по сути, задача урока выполнена, так как документ прочитан и его данные располагаются в структуре только что созданного объекта $phpWord. Но давайте поговорим о том, как же получить данные хранящиеся в объекте.
По официальной документации любая информация документа MS Word, согласно библиотеке PHPWord, располагается в отдельных секциях. При этом каждая секция содержит определенный набор элементов – текст, таблица, изображение, ссылка и т.д. Элементы – же в свою очередь, так же могут быть сложными и включать в себя некий набор вложенных элементов, к примеру таблицы.
Поэтому, вызывая на исполнение метод getSections(), мы получаем доступ к секциям документа, при этом в качестве результата будет возвращен массив, а значит мы его можем обойти циклом foreach().
|
foreach($phpWord—>getSections() as $section) { $arrays = $section—>getElements(); } |
При этом в коде цикла, для каждой секции, получим массив входящих элементов, вызывая на исполнение метод getElements(). Так как возвращаемое значение – это массив, значит, используя выше указанный цикл, мы можем получить доступ к каждой его ячейке.
|
foreach($arrays as $e) { } |
При этом в переменной $e на каждой итерации цикла, содержится объект одного из элементов массива секций. Казалось бы, мы сразу можем получить текстовые данные MS Word, но для начала нужно проверить, что содержится в переменной $e.
|
if(get_class($e) === ‘PhpOfficePhpWordElementTextRun’) { |
Если в данной переменной содержится объект класса ‘PhpOfficePhpWordElementTextRun’, значит мы работаем с сложной текстовой областью, в которой располагается несколько более простых элементов. Поэтому повторно вызываем метод getElements() и по результату проходимся в цикле foreach().
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 |
<?php require ‘vendor/autoload.php’; $source = __DIR__.«/docs/text.docx»; $objReader = PhpOfficePhpWordIOFactory::createReader(‘Word2007’); $phpWord = $objReader—>load($source); $body = »; foreach($phpWord—>getSections() as $section) { $arrays = $section—>getElements(); foreach($arrays as $e) { if(get_class($e) === ‘PhpOfficePhpWordElementTextRun’) { foreach($e—>getElements() as $text) { $font = $text—>getFontStyle(); $size = $font—>getSize()/10; $bold = $font—>isBold() ? ‘font-weight:700;’ :»; $color = $font—>getColor(); $fontFamily = $font—>getName(); $body .= ‘<span style=»font-size:’ . $size . ’em;font-family:’ . $fontFamily . ‘; ‘.$bold.‘; color:#’.$color.‘»>’; $body .= $text—>getText().‘</span>’; } } } } include ‘templ.php’; |
Таким образом, для текущего документа, в переменную $text, попадает объект элемента Text, то есть элемент простейшего текст, для получения которого достаточно вызвать на исполнение метод getText(). Для получения информации о форматировании текущего элемента, необходимо обратиться к методу getFontStyle(), который вернет объект в закрытых свойствах которого содержится указанная информация. Соответственно для доступа к значениям этих свойств необходимо использовать специальные методы:
getSize() – размер шрифта;
isBold() — возвращает истину, если используется полужирный шрифт;
getColor() – цвет текста;
getName() – имя шрифта.
Все содержимое документа, записывается в переменную $body, значение которой будет отображено на экране, используя шаблон. Пустые строки документа представляют собой объект элемента TextBreak, который можно обработать следующим образом:
|
else if(get_class($e) === ‘PhpOfficePhpWordElementTextBreak’) { $body .= ‘<br />’; } |
Для обработки таблиц, придется добавить достаточно много строк кода, потому как таблица – это сложный элемент Table, который состоит из отдельных строк, а те в свою очередь из отдельных ячеек. И более того, каждая ячейка, может содержать еще вложенные элементы, потому как, к примеру в одной ячейке так же можно сформировать таблицу. Ниже приведу весь код, вместе с кодом обработки таблиц.
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 |
<?php require ‘vendor/autoload.php’; $source = __DIR__.«/docs/text.docx»; $objReader = PhpOfficePhpWordIOFactory::createReader(‘Word2007’); $phpWord = $objReader—>load($source); $body = »; foreach($phpWord—>getSections() as $section) { $arrays = $section—>getElements(); foreach($arrays as $e) { if(get_class($e) === ‘PhpOfficePhpWordElementTextRun’) { foreach($e—>getElements() as $text) { $font = $text—>getFontStyle(); $size = $font—>getSize()/10; $bold = $font—>isBold() ? ‘font-weight:700;’ :»; $color = $font—>getColor(); $fontFamily = $font—>getName(); $body .= ‘<span style=»font-size:’ . $size . ’em;font-family:’ . $fontFamily . ‘; ‘.$bold.‘; color:#’.$color.‘»>’; $body .= $text—>getText().‘</span>’; } } else if(get_class($e) === ‘PhpOfficePhpWordElementTextBreak’) { $body .= ‘<br />’; } else if(get_class($e) === ‘PhpOfficePhpWordElementTable’) { $body .= ‘<table border=»2px»>’; $rows = $e—>getRows(); foreach($rows as $row) { $body .= ‘<tr>’; $cells = $row—>getCells(); foreach($cells as $cell) { $body .= ‘<td style=»width:’.$cell—>getWidth().‘»>’; $celements = $cell—>getElements(); foreach($celements as $celem) { if(get_class($celem) === ‘PhpOfficePhpWordElementText’) { $body .= $celem—>getText(); } else if(get_class($celem) === ‘PhpOfficePhpWordElementTextRun’) { foreach($celem—>getElements() as $text) { $body .= $text—>getText(); } } } $body .= ‘</td>’; } $body .= ‘</tr>’; } $body .= ‘</table>’; } else { $body .= $e—>getText(); } } break; } include ‘templ.php’; |
Для получения строк, необходимо вызвать метод getRows(), при этом в качестве результата будет возвращен массив объектов с информацией по каждой строке (элемент Row). Используя foreach(), обходим данный массив и для каждой строки получаем ячейки, при помощи метода getCells(). При этом опять же возвращается массив, который все так же мы обходим циклом. А далее для каждой ячейки вызываем на исполнение метод getElements(), для получения ее элементов. И так далее по принципу описанным выше.
Далее, осталось только отобразить значение переменной $body, любым удобным для Вас способом.
На этом данный урок я буду завершать. Как Вы видите, PHPWord предоставляет достаточно мощные инструменты по работе с документами MS Word, но и в тоже время сложные в плане получения данных из объектов.
Всего Вам доброго и удачного кодирования!!!
PHPWord is a library written in pure PHP that provides a set of classes to write to and read from different document file formats. The current version of PHPWord supports Microsoft Office Open XML (OOXML or OpenXML), OASIS Open Document Format for Office Applications (OpenDocument or ODF), Rich Text Format (RTF), HTML, and PDF.
PHPWord is an open source project licensed under the terms of LGPL version 3. PHPWord is aimed to be a high quality software product by incorporating continuous integration and unit testing. You can learn more about PHPWord by reading the Developers’ Documentation.
If you have any questions, please ask on StackOverFlow
Read more about PHPWord:
- Features
- Requirements
- Installation
- Getting started
- Contributing
- Developers’ Documentation
Features
With PHPWord, you can create OOXML, ODF, or RTF documents dynamically using your PHP scripts. Below are some of the things that you can do with PHPWord library:
- Set document properties, e.g. title, subject, and creator.
- Create document sections with different settings, e.g. portrait/landscape, page size, and page numbering
- Create header and footer for each sections
- Set default font type, font size, and paragraph style
- Use UTF-8 and East Asia fonts/characters
- Define custom font styles (e.g. bold, italic, color) and paragraph styles (e.g. centered, multicolumns, spacing) either as named style or inline in text
- Insert paragraphs, either as a simple text or complex one (a text run) that contains other elements
- Insert titles (headers) and table of contents
- Insert text breaks and page breaks
- Insert and format images, either local, remote, or as page watermarks
- Insert binary OLE Objects such as Excel or Visio
- Insert and format table with customized properties for each rows (e.g. repeat as header row) and cells (e.g. background color, rowspan, colspan)
- Insert list items as bulleted, numbered, or multilevel
- Insert hyperlinks
- Insert footnotes and endnotes
- Insert drawing shapes (arc, curve, line, polyline, rect, oval)
- Insert charts (pie, doughnut, bar, line, area, scatter, radar)
- Insert form fields (textinput, checkbox, and dropdown)
- Create document from templates
- Use XSL 1.0 style sheets to transform headers, main document part, and footers of an OOXML template
- … and many more features on progress
Requirements
PHPWord requires the following:
- PHP 7.1+
- XML Parser extension
- Laminas Escaper component
- Zip extension (optional, used to write OOXML and ODF)
- GD extension (optional, used to add images)
- XMLWriter extension (optional, used to write OOXML and ODF)
- XSL extension (optional, used to apply XSL style sheet to template )
- dompdf library (optional, used to write PDF)
Installation
PHPWord is installed via Composer.
To add a dependency to PHPWord in your project, either
Run the following to use the latest stable version
composer require phpoffice/phpword
or if you want the latest unreleased version
composer require phpoffice/phpword:dev-master
Getting started
The following is a basic usage example of the PHPWord library.
<?php require_once 'bootstrap.php'; // Creating the new document... $phpWord = new PhpOfficePhpWordPhpWord(); /* Note: any element you append to a document must reside inside of a Section. */ // Adding an empty Section to the document... $section = $phpWord->addSection(); // Adding Text element to the Section having font styled by default... $section->addText( '"Learn from yesterday, live for today, hope for tomorrow. ' . 'The important thing is not to stop questioning." ' . '(Albert Einstein)' ); /* * Note: it's possible to customize font style of the Text element you add in three ways: * - inline; * - using named font style (new font style object will be implicitly created); * - using explicitly created font style object. */ // Adding Text element with font customized inline... $section->addText( '"Great achievement is usually born of great sacrifice, ' . 'and is never the result of selfishness." ' . '(Napoleon Hill)', array('name' => 'Tahoma', 'size' => 10) ); // Adding Text element with font customized using named font style... $fontStyleName = 'oneUserDefinedStyle'; $phpWord->addFontStyle( $fontStyleName, array('name' => 'Tahoma', 'size' => 10, 'color' => '1B2232', 'bold' => true) ); $section->addText( '"The greatest accomplishment is not in never falling, ' . 'but in rising again after you fall." ' . '(Vince Lombardi)', $fontStyleName ); // Adding Text element with font customized using explicitly created font style object... $fontStyle = new PhpOfficePhpWordStyleFont(); $fontStyle->setBold(true); $fontStyle->setName('Tahoma'); $fontStyle->setSize(13); $myTextElement = $section->addText('"Believe you can and you're halfway there." (Theodor Roosevelt)'); $myTextElement->setFontStyle($fontStyle); // Saving the document as OOXML file... $objWriter = PhpOfficePhpWordIOFactory::createWriter($phpWord, 'Word2007'); $objWriter->save('helloWorld.docx'); // Saving the document as ODF file... $objWriter = PhpOfficePhpWordIOFactory::createWriter($phpWord, 'ODText'); $objWriter->save('helloWorld.odt'); // Saving the document as HTML file... $objWriter = PhpOfficePhpWordIOFactory::createWriter($phpWord, 'HTML'); $objWriter->save('helloWorld.html'); /* Note: we skip RTF, because it's not XML-based and requires a different example. */ /* Note: we skip PDF, because "HTML-to-PDF" approach is used to create PDF documents. */
More examples are provided in the samples folder. For an easy access to those samples launch php -S localhost:8000 in the samples directory then browse to http://localhost:8000 to view the samples.
You can also read the Developers’ Documentation for more detail.
Contributing
We welcome everyone to contribute to PHPWord. Below are some of the things that you can do to contribute.
- Read our contributing guide.
- Fork us and request a pull to the master branch.
- Submit bug reports or feature requests to GitHub.
- Follow @PHPWord and @PHPOffice on Twitter.
Is it possible to read and write Word (2003 and 2007) files in PHP without using a COM object?
I know that I can:
$file = fopen('c:file.doc', 'w+');
fwrite($file, $text);
fclose();
but Word will read it as an HTML file not a native .doc file.
asked Oct 9, 2008 at 18:09
UnkwnTechUnkwnTech
87.1k65 gold badges183 silver badges229 bronze badges
1
Reading binary Word documents would involve creating a parser according to the published file format specifications for the DOC format. I think this is no real feasible solution.
You could use the Microsoft Office XML formats for reading and writing Word files — this is compatible with the 2003 and 2007 version of Word. For reading you have to ensure that the Word documents are saved in the correct format (it’s called Word 2003 XML-Document in Word 2007). For writing you just have to follow the openly available XML schema. I’ve never used this format for writing out Office documents from PHP, but I’m using it for reading in an Excel worksheet (naturally saved as XML-Spreadsheet 2003) and displaying its data on a web page. As the files are plainly XML data it’s no problem to navigate within and figure out how to extract the data you need.
The other option — a Word 2007 only option (if the OpenXML file formats are not installed in your Word 2003) — would be to ressort to OpenXML. As databyss pointed out here the DOCX file format is just a ZIP archive with XML files included. There are a lot of resources on MSDN regarding the OpenXML file format, so you should be able to figure out how to read the data you want. Writing will be much more complicated I think — it just depends on how much time you’ll invest.
Perhaps you can have a look at PHPExcel which is a library able to write to Excel 2007 files and read from Excel 2007 files using the OpenXML standard. You could get an idea of the work involved when trying to read and write OpenXML Word documents.
answered Nov 5, 2008 at 13:04
Stefan GehrigStefan Gehrig
82.3k24 gold badges158 silver badges188 bronze badges
1
this works with vs < office 2007 and its pure PHP, no COM crap, still trying to figure 2007
<?php
/*****************************************************************
This approach uses detection of NUL (chr(00)) and end line (chr(13))
to decide where the text is:
- divide the file contents up by chr(13)
- reject any slices containing a NUL
- stitch the rest together again
- clean up with a regular expression
*****************************************************************/
function parseWord($userDoc)
{
$fileHandle = fopen($userDoc, "r");
$line = @fread($fileHandle, filesize($userDoc));
$lines = explode(chr(0x0D),$line);
$outtext = "";
foreach($lines as $thisline)
{
$pos = strpos($thisline, chr(0x00));
if (($pos !== FALSE)||(strlen($thisline)==0))
{
} else {
$outtext .= $thisline." ";
}
}
$outtext = preg_replace("/[^a-zA-Z0-9s,.-nrt@/_()]/","",$outtext);
return $outtext;
}
$userDoc = "cv.doc";
$text = parseWord($userDoc);
echo $text;
?>
UnkwnTech
87.1k65 gold badges183 silver badges229 bronze badges
answered Nov 5, 2008 at 12:35
2
You can use Antiword, it is a free MS Word reader for Linux and most popular OS.
$document_file = 'c:file.doc';
$text_from_doc = shell_exec('/usr/local/bin/antiword '.$document_file);
answered May 23, 2009 at 0:57
MantichoraMantichora
3854 silver badges8 bronze badges
5
I don’t know about reading native Word documents in PHP, but if you want to write a Word document in PHP, WordprocessingML (aka WordML) might be a good solution. All you have to do is create an XML document in the correct format. I believe Word 2003 and 2007 both support WordML.
answered Oct 10, 2008 at 0:23
Joe LencioniJoe Lencioni
10.2k17 gold badges54 silver badges66 bronze badges
Just updating the code
<?php
/*****************************************************************
This approach uses detection of NUL (chr(00)) and end line (chr(13))
to decide where the text is:
- divide the file contents up by chr(13)
- reject any slices containing a NUL
- stitch the rest together again
- clean up with a regular expression
*****************************************************************/
function parseWord($userDoc)
{
$fileHandle = fopen($userDoc, "r");
$word_text = @fread($fileHandle, filesize($userDoc));
$line = "";
$tam = filesize($userDoc);
$nulos = 0;
$caracteres = 0;
for($i=1536; $i<$tam; $i++)
{
$line .= $word_text[$i];
if( $word_text[$i] == 0)
{
$nulos++;
}
else
{
$nulos=0;
$caracteres++;
}
if( $nulos>1996)
{
break;
}
}
//echo $caracteres;
$lines = explode(chr(0x0D),$line);
//$outtext = "<pre>";
$outtext = "";
foreach($lines as $thisline)
{
$tam = strlen($thisline);
if( !$tam )
{
continue;
}
$new_line = "";
for($i=0; $i<$tam; $i++)
{
$onechar = $thisline[$i];
if( $onechar > chr(240) )
{
continue;
}
if( $onechar >= chr(0x20) )
{
$caracteres++;
$new_line .= $onechar;
}
if( $onechar == chr(0x14) )
{
$new_line .= "</a>";
}
if( $onechar == chr(0x07) )
{
$new_line .= "t";
if( isset($thisline[$i+1]) )
{
if( $thisline[$i+1] == chr(0x07) )
{
$new_line .= "n";
}
}
}
}
//troca por hiperlink
$new_line = str_replace("HYPERLINK" ,"<a href=",$new_line);
$new_line = str_replace("o" ,">",$new_line);
$new_line .= "n";
//link de imagens
$new_line = str_replace("INCLUDEPICTURE" ,"<br><img src=",$new_line);
$new_line = str_replace("*" ,"><br>",$new_line);
$new_line = str_replace("MERGEFORMATINET" ,"",$new_line);
$outtext .= nl2br($new_line);
}
return $outtext;
}
$userDoc = "custo.doc";
$userDoc = "Cultura.doc";
$text = parseWord($userDoc);
echo $text;
?>
answered Apr 4, 2011 at 2:43
WIlsonWIlson
611 silver badge1 bronze badge
4
Most probably you won’t be able to read Word documents without COM.
Writing was covered in this topic
answered Oct 10, 2008 at 2:17
Sergey KornilovSergey Kornilov
1,7722 gold badges13 silver badges22 bronze badges
2007 might be a bit complicated as well.
The .docx format is a zip file that contains a few folders with other files in them for formatting and other stuff.
Rename a .docx file to .zip and you’ll see what I mean.
So if you can work within zip files in PHP, you should be on the right path.
0
www.phplivedocx.org is a SOAP based service that means that you always need to be online for testing the Files also does not have enough examples for its use . Strangely I found only after 2 days of downloading (requires additionaly zend framework too) that its a SOAP based program(cursed me !!!)…I think without COM its just not possible on a Linux server and the only idea is to change the doc file in another usable file which PHP can parse…
answered Sep 13, 2009 at 17:45
Source gotten from
Use following class directly to read word document
class DocxConversion{
private $filename;
public function __construct($filePath) {
$this->filename = $filePath;
}
private function read_doc() {
$fileHandle = fopen($this->filename, "r");
$line = @fread($fileHandle, filesize($this->filename));
$lines = explode(chr(0x0D),$line);
$outtext = "";
foreach($lines as $thisline)
{
$pos = strpos($thisline, chr(0x00));
if (($pos !== FALSE)||(strlen($thisline)==0))
{
} else {
$outtext .= $thisline." ";
}
}
$outtext = preg_replace("/[^a-zA-Z0-9s,.-nrt@/_()]/","",$outtext);
return $outtext;
}
private function read_docx(){
$striped_content = '';
$content = '';
$zip = zip_open($this->filename);
if (!$zip || is_numeric($zip)) return false;
while ($zip_entry = zip_read($zip)) {
if (zip_entry_open($zip, $zip_entry) == FALSE) continue;
if (zip_entry_name($zip_entry) != "word/document.xml") continue;
$content .= zip_entry_read($zip_entry, zip_entry_filesize($zip_entry));
zip_entry_close($zip_entry);
}// end while
zip_close($zip);
$content = str_replace('</w:r></w:p></w:tc><w:tc>', " ", $content);
$content = str_replace('</w:r></w:p>', "rn", $content);
$striped_content = strip_tags($content);
return $striped_content;
}
/************************excel sheet************************************/
function xlsx_to_text($input_file){
$xml_filename = "xl/sharedStrings.xml"; //content file name
$zip_handle = new ZipArchive;
$output_text = "";
if(true === $zip_handle->open($input_file)){
if(($xml_index = $zip_handle->locateName($xml_filename)) !== false){
$xml_datas = $zip_handle->getFromIndex($xml_index);
$xml_handle = DOMDocument::loadXML($xml_datas, LIBXML_NOENT | LIBXML_XINCLUDE | LIBXML_NOERROR | LIBXML_NOWARNING);
$output_text = strip_tags($xml_handle->saveXML());
}else{
$output_text .="";
}
$zip_handle->close();
}else{
$output_text .="";
}
return $output_text;
}
/*************************power point files*****************************/
function pptx_to_text($input_file){
$zip_handle = new ZipArchive;
$output_text = "";
if(true === $zip_handle->open($input_file)){
$slide_number = 1; //loop through slide files
while(($xml_index = $zip_handle->locateName("ppt/slides/slide".$slide_number.".xml")) !== false){
$xml_datas = $zip_handle->getFromIndex($xml_index);
$xml_handle = DOMDocument::loadXML($xml_datas, LIBXML_NOENT | LIBXML_XINCLUDE | LIBXML_NOERROR | LIBXML_NOWARNING);
$output_text .= strip_tags($xml_handle->saveXML());
$slide_number++;
}
if($slide_number == 1){
$output_text .="";
}
$zip_handle->close();
}else{
$output_text .="";
}
return $output_text;
}
public function convertToText() {
if(isset($this->filename) && !file_exists($this->filename)) {
return "File Not exists";
}
$fileArray = pathinfo($this->filename);
$file_ext = $fileArray['extension'];
if($file_ext == "doc" || $file_ext == "docx" || $file_ext == "xlsx" || $file_ext == "pptx")
{
if($file_ext == "doc") {
return $this->read_doc();
} elseif($file_ext == "docx") {
return $this->read_docx();
} elseif($file_ext == "xlsx") {
return $this->xlsx_to_text();
}elseif($file_ext == "pptx") {
return $this->pptx_to_text();
}
} else {
return "Invalid File Type";
}
}
}
$docObj = new DocxConversion("test.docx"); //replace your document name with correct extension doc or docx
echo $docText= $docObj->convertToText();
answered Jul 3, 2019 at 10:25
Office 2007 .docx should be possible since it’s an XML standard. Word 2003 most likely requires COM to read, even with the standards now published by MS, since those standards are huge. I haven’t seen many libraries written to match them yet.
answered Oct 10, 2008 at 2:45
acrosmanacrosman
12.8k10 gold badges40 silver badges55 bronze badges
I don’t know what you are going to use it for, but I needed .doc support for search indexing; What I did was use a little commandline tool called «catdoc»; This transfers the contents of the Word document to plain text so it can be indexed. If you need to keep formatting and stuff this is not your tool.
answered Oct 10, 2008 at 15:25
fijterfijter
17.5k2 gold badges24 silver badges28 bronze badges
phpLiveDocx is a Zend Framework component and can read and write DOC and DOCX files in PHP on Linux, Windows and Mac.
See the project web site at:
answered May 14, 2009 at 7:03
1
One way to manipulate Word files with PHP that you may find interesting is with the help of PHPDocX.
You may see how it works having a look at its online tutorial.
You can insert or extract contents or even merge multiple Word files into a asingle one.
answered Sep 28, 2012 at 16:44
Would the .rtf format work for your purposes? .rtf can easily be converted to and from .doc format, but it is written in plaintext (with control commands embedded). This is how I plan to integrate my application with Word documents.
answered Jan 24, 2009 at 5:09
Josh SmeatonJosh Smeaton
47.6k24 gold badges129 silver badges164 bronze badges
1
even i’m working on same kind of project [An Onlinw Word Processor]!
But i’ve choosen c#.net and ASP.net. But through the survey i did; i got to know that
By Using Open XML SDK and VSTO [Visual Studio Tools For Office]
we may easily work with a word file manipulate them and even convert internally to different into several formats such as .odt,.pdf,.docx etc..
So, goto msdn.microsoft.com and be thorough about the office development tab. Its the easiest way to do this as all functions we need to implement are already available in .net!!
But as u want to do ur project in PHP, u can do it in Visual Studio and .net as PHP is also one of the .net Compliant Language!!
answered Sep 5, 2010 at 14:17
Noddy ChaNoddy Cha
8511 gold badge12 silver badges19 bronze badges
I have the same case
I guess I am going to use a cheap 50 mega windows based hosting with free domain to use it to convert my files on, for PHP server. And linking them is easy.
All you need is make an ASP.NET page that recieves the doc file via post and replies it via HTTP
so simple CURL would do it.
answered Oct 11, 2010 at 19:12
1
//For DOCX.If you want to preserve white spaces, also take care of tables tr and tc, use the codes below: Modify it to your taste. Cos it downloads the file from a remote or local
//=========DOCX===========
function extractDocxText($url,$file_name){
$docx = get_url($url);
file_put_contents("tempf.docx",$docx);
$xml_filename = "word/document.xml"; //content file name
$zip_handle = new ZipArchive;
$output_text = "";
if(true === $zip_handle->open("tempf.docx")){
if(($xml_index = $zip_handle->locateName($xml_filename)) !== false){
$xml_datas = $zip_handle->getFromIndex($xml_index);
//file_put_contents($input_file.".xml",$xml_datas);
$replace_newlines = preg_replace('/<w:p w[0-9-Za-z]+:[a-zA-Z0-9]+="[a-zA-z"0-9 :="]+">/',"nr",$xml_datas);
$replace_tableRows = preg_replace('/<w:tr>/',"nr",$replace_newlines);
$replace_tab = preg_replace('/<w:tab/>/',"t",$replace_tableRows);
$replace_paragraphs = preg_replace('/</w:p>/',"nr",$replace_tab);
$replace_other_Tags = strip_tags($replace_paragraphs);
$output_text = $replace_other_Tags;
}else{
$output_text .="";
}
$zip_handle->close();
}else{
$output_text .=" ";
}
chmod("tempf.docx", 0777); unlink(realpath("tempf.docx"));
//save to file or echo content
file_put_contents($file_name,$output_text);
echo $output_text;
}
//========PDF===========
//Requires installation in your Linux server
//sudo su
//apt-get install xpdf
function extractPdfText($url,$PDF_fullpath_or_Filename){
$pdf = get_url($url);
file_put_contents ("temppdf.txt", $pdf);
$content = pdf2text("temppdf.txt");
chmod("temppdf.txt", 0777); unlink(realpath("temppdf.txt"));
echo $content;
file_put_contents($PDF_fullpath_or_Filename,$content);
}
//========DOC==========
function extractDocText($url,$file_name){
$doc = get_url($url);
file_put_contents ("tempf.txt", $doc);
$fileHandle = fopen("tempf.txt", "r");
$line = @fread($fileHandle, filesize("tempf.txt"));
$lines = explode(chr(0x0D),$line);
$outtext = "";
foreach($lines as $thisline){
$pos = strpos($thisline, chr(0x00));
if (($pos !== FALSE)||(strlen($thisline)==0))
{} else {$outtext .= $thisline."nr";}
}
$content = preg_replace('/[a-zA-Z0-9s,.-nrt@/_()]/',' ',$outtext);
//chmod("tempf.txt", 0777); unlink(realpath("tempf.txt"));
echo $content;
file_put_contents($file_name,$content);
}
//========XLSX==========
function extractXlsxText($url,$file_name){
$xlsx = get_url($url);
file_put_contents ("tempf.txt", $xlsx);
$content = "";
$dir = 'tempforxlsx';
// Unzip
$zip = new ZipArchive();
$zip->open("tempf.txt");
$zip->extractTo($dir);
// Open up shared strings & the first worksheet
$strings = simplexml_load_file($dir . '/xl/sharedStrings.xml');
$sheet = simplexml_load_file($dir . '/xl/worksheets/sheet1.xml');
// Parse the rows
$xlrows = $sheet->sheetData->row;
foreach ($xlrows as $xlrow) {
$arr = array();
// In each row, grab it's value
foreach ($xlrow->c as $cell) {
$v = (string) $cell->v;
// If it has a "t" (type?) of "s" (string?), use the value to look up string value
if (isset($cell['t']) && $cell['t'] == 's') {
$s = array();
$si = $strings->si[(int) $v];
// Register & alias the default namespace or you'll get empty results in the xpath query
$si->registerXPathNamespace('n', 'http://schemas.openxmlformats.org/spreadsheetml/2006/main');
// Cat together all of the 't' (text?) node values
foreach($si->xpath('.//n:t') as $t) {
$content .= $t." ";} }
}
}
echo $content;
file_put_contents($file_name,$content);
}
//========PPT==========
function extractPptText($url,$file_name){
$ppt = file_get_contents($url);
file_put_contents ("tempf.ppt", $ppt);
$fileHandle = fopen("tempf.ppt", "r");
$line = @fread($fileHandle, filesize("tempf.ppt"));
$lines = explode(chr(0x0f),$line);
$outtext = '';
foreach($lines as $thisline) {
if (strpos($thisline, chr(0x00).chr(0x00).chr(0x00)) == 1) {
$text_line = substr($thisline, 4);
$end_pos = strpos($text_line, chr(0x00));
$text_line = substr($text_line, 0, $end_pos);
$text_line = preg_replace('/[^a-zA-Z0-9s,.-nrt@/_()]/'," ",$text_line);
$outtext = substr($text_line, 0, $end_pos)."n".$outtext;
}
}
//echo $outtext;
file_put_contents($file_name,$outtext);
}
//========PPTX==========
function extractPptxText($url,$file_name){
$xls = get_url($url);
file_put_contents ("tempf.txt", $xls);
$zip_handle = new ZipArchive;
$output_text = ' ';
if(true === $zip_handle->open("tempf.txt")){
$slide_number = 1; //loop through slide files
while(($xml_index = $zip_handle->locateName("ppt/slides/slide".$slide_number.".xml")) !== false){
$xml_datas = $zip_handle->getFromIndex($xml_index); // these four lines of codes
// below were
$xml_handle = new DOMDocument (); // added by me in order
$xml_handle->preserveWhiteSpace = true; // to preserve space between
$xml_handle->formatOutput = true; // each read data
$xml_handle->loadXML($xml_datas, LIBXML_NOENT | LIBXML_XINCLUDE | LIBXML_NOERROR | LIBXML_NOWARNING);
$output_text .= $xml_handle->saveXML();
$slide_number++;
}
if($slide_number == 1){
$output_text .= "";
}
$zip_handle->close();
}else{
$output_text .= "";
}
echo $output_text;
file_put_contents($file_name,$output_text);
}
/*
==========================================================================
=========================================================================
And below is get_url() function: Better than fie_get_contents();
*/
function get_url( $url,$timeout = 5 )
{
$url = str_replace( "&", "&", urldecode(trim($url)) );
$ch = curl_init();
curl_setopt( $ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 5.1; rv:1.7.3) Gecko/20041001 Firefox/0.10.1" );
curl_setopt( $ch, CURLOPT_URL, $url );
curl_setopt( $ch, CURLOPT_FOLLOWLOCATION, true );
curl_setopt( $ch, CURLOPT_ENCODING, "" );
curl_setopt( $ch, CURLOPT_RETURNTRANSFER, true );
curl_setopt( $ch, CURLOPT_AUTOREFERER, true );
curl_setopt( $ch, CURLOPT_SSL_VERIFYPEER, false ); # required for https urls
curl_setopt( $ch, CURLOPT_CONNECTTIMEOUT, $timeout );
curl_setopt( $ch, CURLOPT_TIMEOUT, $timeout );
curl_setopt( $ch, CURLOPT_MAXREDIRS, 10 );
$content = curl_exec( $ch );
//$response = curl_getinfo( $ch );
curl_close ( $ch );
return $content;
}
How to read and view docx Files using PHP. Now days processing Word Document is becoming more popular. Even you can create a new Word Document and process with it. My previous article describes you to create Word Document by using PHP.
Today we are going to discuss about reading the Docx files and convert it into text and view it online. Let’s begin with steps and codes,
<?php
function kv_read_word($input_file){
$kv_strip_texts = '';
$kv_texts = '';
if(!$input_file || !file_exists($input_file)) return false;
$zip = zip_open($input_file);
if (!$zip || is_numeric($zip)) return false;
while ($zip_entry = zip_read($zip)) {
if (zip_entry_open($zip, $zip_entry) == FALSE) continue;
if (zip_entry_name($zip_entry) != "word/document.xml") continue;
$kv_texts .= zip_entry_read($zip_entry, zip_entry_filesize($zip_entry));
zip_entry_close($zip_entry);
}
zip_close($zip);
$kv_texts = str_replace('</w:r></w:p></w:tc><w:tc>', " ", $kv_texts);
$kv_texts = str_replace('</w:r></w:p>', "rn", $kv_texts);
$kv_strip_texts = nl2br(strip_tags($kv_texts,’‘));
return $kv_strip_texts;
}
?>
The above function will helps you to get parse the text’s in a Word Document and return it.
Now, we need to give the input file and its path as input to the function and print it for results.
<?php
$kv_texts = kv_read_word('path/to/the/file/kvcodes.docx');
if($kv_texts !== false) {
echo nl2br($kv_texts);
}
else {
echo 'Can't Read that file.';
}
?>
That’s it to read a docx file and print it as text.
I have another article for WordPress user, who can try this to process Docx files using php and WordPress
How to Read and get Texts from Docx Files in WordPress
Contents
- Introduction
- Features
- File formats
- Installing/configuring
- Requirements
- Installation
- Using samples
- General usage
- Basic example
- Settings
- Default font
- Document properties
- Measurement units
- Containers
- Sections
- Headers
- Footers
- Other containers
- Elements
- Texts
- Breaks
- Lists
- Tables
- Images
- Objects
- Table of contents
- Footnotes & endnotes
- Checkboxes
- Textboxes
- Fields
- Lines
- Shapes
- Charts
- FormFields
- Styles
- Section
- Font
- Paragraph
- Table
- Templates processing
- Writers & readers
- OOXML
- OpenDocument
- RTF
- HTML
- Recipes
- Frequently asked questions
- References
Introduction
PHPWord is a library written in pure PHP that provides a set of classes to write to and read from different document file formats. The current version of PHPWord supports Microsoft Office Open XML (OOXML or OpenXML), OASIS Open Document Format for Office Applications (OpenDocument or ODF), and Rich Text Format (RTF).
PHPWord is an open source project licensed under the terms of LGPL version 3. PHPWord is aimed to be a high quality software product by incorporating continuous integration and unit testing. You can learn more about PHPWord by reading this Developers’ Documentation and the API Documentation.
Features
- Set document properties, e.g. title, subject, and creator.
- Create document sections with different settings, e.g. portrait/landscape, page size, and page numbering
- Create header and footer for each sections
- Set default font type, font size, and paragraph style
- Use UTF-8 and East Asia fonts/characters
- Define custom font styles (e.g. bold, italic, color) and paragraph styles (e.g. centered, multicolumns, spacing) either as named style or inline in text
- Insert paragraphs, either as a simple text or complex one (a text run) that contains other elements
- Insert titles (headers) and table of contents
- Insert text breaks and page breaks
- Insert and format images, either local, remote, or as page watermarks
- Insert binary OLE Objects such as Excel or Visio
- Insert and format table with customized properties for each rows (e.g. repeat as header row) and cells (e.g. background color, rowspan, colspan)
- Insert list items as bulleted, numbered, or multilevel
- Insert hyperlinks
- Insert footnotes and endnotes
- Insert drawing shapes (arc, curve, line, polyline, rect, oval)
- Insert charts (pie, doughnut, bar, line, area, scatter, radar)
- Insert form fields (textinput, checkbox, and dropdown)
- Create document from templates
- Use XSL 1.0 style sheets to transform main document part of OOXML template
- … and many more features on progress
File formats
Below are the supported features for each file formats.
Writers
| Features | DOCX | ODT | RTF | HTML | ||
|---|---|---|---|---|---|---|
| Document Properties | Standard | ✓ | ✓ | ✓ | ✓ | |
| Custom | ✓ | ✓ | ||||
| Element Type | Text | ✓ | ✓ | ✓ | ✓ | ✓ |
| Text Run | ✓ | ✓ | ✓ | ✓ | ✓ | |
| Title | ✓ | ✓ | ✓ | ✓ | ||
| Link | ✓ | ✓ | ✓ | ✓ | ✓ | |
| Preserve Text | ✓ | |||||
| Text Break | ✓ | ✓ | ✓ | ✓ | ✓ | |
| Page Break | ✓ | ✓ | ||||
| List | ✓ | |||||
| Table | ✓ | ✓ | ✓ | ✓ | ✓ | |
| Image | ✓ | ✓ | ✓ | ✓ | ||
| Object | ✓ | |||||
| Watermark | ✓ | |||||
| Table of Contents | ✓ | |||||
| Header | ✓ | |||||
| Footer | ✓ | |||||
| Footnote | ✓ | ✓ | ||||
| Endnote | ✓ | ✓ | ||||
| Graphs | 2D basic graphs | ✓ | ||||
| 2D advanced graphs | ||||||
| 3D graphs | ✓ | |||||
| Math | OMML support | |||||
| MathML support | ||||||
| Bonus | Encryption | |||||
| Protection |
Readers
| Features | DOCX | ODT | RTF | HTML | |
|---|---|---|---|---|---|
| Document Properties | Standard | ✓ | |||
| Custom | ✓ | ||||
| Element Type | Text | ✓ | ✓ | ✓ | ✓ |
| Text Run | ✓ | ||||
| Title | ✓ | ✓ | |||
| Link | ✓ | ||||
| Preserve Text | ✓ | ||||
| Text Break | ✓ | ||||
| Page Break | ✓ | ||||
| List | ✓ | ✓ | ✓ | ||
| Table | ✓ | ✓ | |||
| Image | ✓ | ||||
| Object | |||||
| Watermark | |||||
| Table of Contents | |||||
| Header | ✓ | ||||
| Footer | ✓ | ||||
| Footnote | ✓ | ||||
| Endnote | ✓ | ||||
| Graphs | 2D basic graphs | ||||
| 2D advanced graphs | |||||
| 3D graphs | |||||
| Math | OMML support | ||||
| MathML support | |||||
| Bonus | Encryption | ||||
| Protection |
Contributing
We welcome everyone to contribute to PHPWord. Below are some of the things that you can do to contribute:
- Read our contributing guide
- Fork us and request a pull to the develop branch
- Submit bug reports or feature requests to GitHub
- Follow @PHPWord and @PHPOffice on Twitter
Installing/configuring
Requirements
Mandatory:
- PHP 5.3+
- PHP Zip extension
- PHP XML Parser extension
Optional PHP extensions:
- GD
- XMLWriter
- XSL
Installation
There are two ways to install PHPWord, i.e. via Composer or manually by downloading the library.
Using Composer
To install via Composer, add the following lines to your composer.json:
{
"require": {
"phpoffice/phpword": "dev-master"
}
}
Manual install
To install manually, download PHPWord package from github. Extract the package and put the contents to your machine. To use the library, include src/PhpWord/Autoloader.php in your script and invoke Autoloader::register.
require_once '/path/to/src/PhpWord/Autoloader.php'; PhpOfficePhpWordAutoloader::register();
Using samples
After installation, you can browse and use the samples that we’ve provided, either by command line or using browser. If you can access your PHPWord library folder using browser, point your browser to the samples folder, e.g. http://localhost/PhpWord/samples/.
General usage
Basic example
The following is a basic example of the PHPWord library. More examples are provided in the samples folder.
<?php require_once 'src/PhpWord/Autoloader.php'; PhpOfficePhpWordAutoloader::register(); // Creating the new document... $phpWord = new PhpOfficePhpWordPhpWord(); /* Note: any element you append to a document must reside inside of a Section. */ // Adding an empty Section to the document... $section = $phpWord->addSection(); // Adding Text element to the Section having font styled by default... $section->addText( htmlspecialchars( '"Learn from yesterday, live for today, hope for tomorrow. ' . 'The important thing is not to stop questioning." ' . '(Albert Einstein)' ) ); /* * Note: it's possible to customize font style of the Text element you add in three ways: * - inline; * - using named font style (new font style object will be implicitly created); * - using explicitly created font style object. */ // Adding Text element with font customized inline... $section->addText( htmlspecialchars( '"Great achievement is usually born of great sacrifice, ' . 'and is never the result of selfishness." ' . '(Napoleon Hill)' ), array('name' => 'Tahoma', 'size' => 10) ); // Adding Text element with font customized using named font style... $fontStyleName = 'oneUserDefinedStyle'; $phpWord->addFontStyle( $fontStyleName, array('name' => 'Tahoma', 'size' => 10, 'color' => '1B2232', 'bold' => true) ); $section->addText( htmlspecialchars( '"The greatest accomplishment is not in never falling, ' . 'but in rising again after you fall." ' . '(Vince Lombardi)' ), $fontStyleName ); // Adding Text element with font customized using explicitly created font style object... $fontStyle = new PhpOfficePhpWordStyleFont(); $fontStyle->setBold(true); $fontStyle->setName('Tahoma'); $fontStyle->setSize(13); $myTextElement = $section->addText( htmlspecialchars('"Believe you can and you're halfway there." (Theodor Roosevelt)') ); $myTextElement->setFontStyle($fontStyle); // Saving the document as OOXML file... $objWriter = PhpOfficePhpWordIOFactory::createWriter($phpWord, 'Word2007'); $objWriter->save('helloWorld.docx'); // Saving the document as ODF file... $objWriter = PhpOfficePhpWordIOFactory::createWriter($phpWord, 'ODText'); $objWriter->save('helloWorld.odt'); // Saving the document as HTML file... $objWriter = PhpOfficePhpWordIOFactory::createWriter($phpWord, 'HTML'); $objWriter->save('helloWorld.html'); /* Note: we skip RTF, because it's not XML-based and requires a different example. */ /* Note: we skip PDF, because "HTML-to-PDF" approach is used to create PDF documents. */
Settings
The PhpOfficePhpWordSettings class provides some options that will affect the behavior of PHPWord. Below are the options.
XML Writer compatibility
This option sets XMLWriter::setIndent and XMLWriter::setIndentString. The default value of this option is true (compatible), which is required for OpenOffice to render OOXML document correctly. You can set this option to false during development to make the resulting XML file easier to read.
PhpOfficePhpWordSettings::setCompatibility(false);
Zip class
By default, PHPWord uses PHP ZipArchive to read or write ZIP compressed archive and the files inside them. If you can’t have ZipArchive installed on your server, you can use pure PHP library alternative, PCLZip, which included with PHPWord.
PhpOfficePhpWordSettings::setZipClass(PhpOfficePhpWordSettings::PCLZIP);
Default font
By default, every text appears in Arial 10 point. You can alter the default font by using the following two functions:
$phpWord->setDefaultFontName('Times New Roman'); $phpWord->setDefaultFontSize(12);
Document information
You can set the document information such as title, creator, and company name. Use the following functions:
$properties = $phpWord->getDocInfo(); $properties->setCreator('My name'); $properties->setCompany('My factory'); $properties->setTitle('My title'); $properties->setDescription('My description'); $properties->setCategory('My category'); $properties->setLastModifiedBy('My name'); $properties->setCreated(mktime(0, 0, 0, 3, 12, 2014)); $properties->setModified(mktime(0, 0, 0, 3, 14, 2014)); $properties->setSubject('My subject'); $properties->setKeywords('my, key, word');
Measurement units
The base length unit in Open Office XML is twip. Twip means «TWentieth of an Inch Point», i.e. 1 twip = 1/1440 inch.
You can use PHPWord helper functions to convert inches, centimeters, or points to twips.
// Paragraph with 6 points space after $phpWord->addParagraphStyle('My Style', array( 'spaceAfter' => PhpOfficePhpWordSharedConverter::pointToTwip(6)) ); $section = $phpWord->addSection(); $sectionStyle = $section->getStyle(); // half inch left margin $sectionStyle->setMarginLeft(PhpOfficePhpWordSharedConverter::inchToTwip(.5)); // 2 cm right margin $sectionStyle->setMarginRight(PhpOfficePhpWordSharedConverter::cmToTwip(2));
Containers
Containers are objects where you can put elements (texts, lists, tables, etc). There are 3 main containers, i.e. sections, headers, and footers. There are 3 elements that can also act as containers, i.e. textruns, table cells, and footnotes.
Sections
Every visible element in word is placed inside of a section. To create a section, use the following code:
$section = $phpWord->addSection($sectionStyle);
The $sectionStyle is an optional associative array that sets the section. Example:
$sectionStyle = array( 'orientation' => 'landscape', 'marginTop' => 600, 'colsNum' => 2, );
Page number
You can change a section page number by using the pageNumberingStart style of the section.
// Method 1 $section = $phpWord->addSection(array('pageNumberingStart' => 1)); // Method 2 $section = $phpWord->addSection(); $section->getStyle()->setPageNumberingStart(1);
Multicolumn
You can change a section layout to multicolumn (like in a newspaper) by using the breakType and colsNum style of the section.
// Method 1 $section = $phpWord->addSection(array('breakType' => 'continuous', 'colsNum' => 2)); // Method 2 $section = $phpWord->addSection(); $section->getStyle()->setBreakType('continuous'); $section->getStyle()->setColsNum(2);
Line numbering
You can apply line numbering to a section by using the lineNumbering style of the section.
// Method 1 $section = $phpWord->addSection(array('lineNumbering' => array())); // Method 2 $section = $phpWord->addSection(); $section->getStyle()->setLineNumbering(array());
Below are the properties of the line numbering style.
startLine numbering starting valueincrementLine number incrementsdistanceDistance between text and line numbering in twiprestartLine numbering restart setting continuous|newPage|newSection
Headers
Each section can have its own header reference. To create a header use the addHeader method:
$header = $section->addHeader();
Be sure to save the result in a local object. You can use all elements that are available for the footer. See «Footer» section for detail. Additionally, only inside of the header reference you can add watermarks or background pictures. See «Watermarks» section.
Footers
Each section can have its own footer reference. To create a footer, use the addFooter method:
$footer = $section->addFooter();
Be sure to save the result in a local object to add elements to a footer. You can add the following elements to footers:
- Texts
addTextandcreateTextrun - Text breaks
- Images
- Tables
- Preserve text
See the «Elements» section for the detail of each elements.
Other containers
Textruns, table cells, and footnotes are elements that can also act as containers. See the corresponding «Elements» section for the detail of each elements.
Elements
Below are the matrix of element availability in each container. The column shows the containers while the rows lists the elements.
| Num | Element | Section | Header | Footer | Cell | Text Run | Footnote |
|---|---|---|---|---|---|---|---|
| 1 | Text | v | v | v | v | v | v |
| 2 | Text Run | v | v | v | v | — | — |
| 3 | Link | v | v | v | v | v | v |
| 4 | Title | v | ? | ? | ? | ? | ? |
| 5 | Preserve Text | ? | v | v | v* | — | — |
| 6 | Text Break | v | v | v | v | v | v |
| 7 | Page Break | v | — | — | — | — | — |
| 8 | List | v | v | v | v | — | — |
| 9 | Table | v | v | v | v | — | — |
| 10 | Image | v | v | v | v | v | v |
| 11 | Watermark | — | v | — | — | — | — |
| 12 | Object | v | v | v | v | v | v |
| 13 | TOC | v | — | — | — | — | — |
| 14 | Footnote | v | — | — | v** | v** | — |
| 15 | Endnote | v | — | — | v** | v** | — |
| 16 | CheckBox | v | v | v | v | — | — |
| 17 | TextBox | v | v | v | v | — | — |
| 18 | Field | v | v | v | v | v | v |
| 19 | Line | v | v | v | v | v | v |
| 20 | Shape | v | v | v | v | v | v |
| 21 | Chart | v | — | — | — | — | — |
| 22 | Form Fields | v | v | v | v | v | v |
Legend:
vAvailablev*Available only when inside header/footerv**Available only when inside section-Not available?Should be available
Texts
Text can be added by using addText and addTextRun method. addText is used for creating simple paragraphs that only contain texts with the same style. addTextRun is used for creating complex paragraphs that contain text with different style (some bold, other italics, etc) or other elements, e.g. images or links. The syntaxes are as follow:
$section->addText($text, [$fontStyle], [$paragraphStyle]); $textrun = $section->addTextRun([$paragraphStyle]);
You can use the $fontStyle and $paragraphStyle variable to define text formatting. There are 2 options to style the inserted text elements, i.e. inline style by using array or defined style by adding style definition.
Inline style examples:
$fontStyle = array('name' => 'Times New Roman', 'size' => 9); $paragraphStyle = array('align' => 'both'); $section->addText('I am simple paragraph', $fontStyle, $paragraphStyle); $textrun = $section->addTextRun(); $textrun->addText('I am bold', array('bold' => true)); $textrun->addText('I am italic', array('italic' => true)); $textrun->addText('I am colored', array('color' => 'AACC00'));
Defined style examples:
$fontStyle = array('color' => '006699', 'size' => 18, 'bold' => true); $phpWord->addFontStyle('fStyle', $fontStyle); $text = $section->addText('Hello world!', 'fStyle'); $paragraphStyle = array('align' => 'center'); $phpWord->addParagraphStyle('pStyle', $paragraphStyle); $text = $section->addText('Hello world!', 'pStyle');
Titles
If you want to structure your document or build table of contents, you need titles or headings. To add a title to the document, use the addTitleStyle and addTitle method.
$phpWord->addTitleStyle($depth, [$fontStyle], [$paragraphStyle]); $section->addTitle($text, [$depth]);
Its necessary to add a title style to your document because otherwise the title won’t be detected as a real title.
Links
You can add Hyperlinks to the document by using the function addLink:
$section->addLink($linkSrc, [$linkName], [$fontStyle], [$paragraphStyle]);
$linkSrcThe URL of the link.$linkNamePlaceholder of the URL that appears in the document.$fontStyleSee «Font style» section.$paragraphStyleSee «Paragraph style» section.
Preserve texts
The addPreserveText method is used to add a page number or page count to headers or footers.
$footer->addPreserveText('Page {PAGE} of {NUMPAGES}.');
Breaks
Text breaks
Text breaks are empty new lines. To add text breaks, use the following syntax. All paramaters are optional.
$section->addTextBreak([$breakCount], [$fontStyle], [$paragraphStyle]);
$breakCountHow many lines$fontStyleSee «Font style» section.$paragraphStyleSee «Paragraph style» section.
Page breaks
There are two ways to insert a page breaks, using the addPageBreak method or using the pageBreakBefore style of paragraph.
$section->addPageBreak();
Lists
To add a list item use the function addListItem.
Basic usage:
$section->addListItem($text, [$depth], [$fontStyle], [$listStyle], [$paragraphStyle]);
Parameters:
$textText that appears in the document.$depthDepth of list item.$fontStyleSee «Font style» section.$listStyleList style of the current element TYPE_NUMBER, TYPE_ALPHANUM, TYPE_BULLET_FILLED, etc. See list of constants in PHPWord_Style_ListItem.$paragraphStyleSee «Paragraph style» section.
Advanced usage:
You can also create your own numbering style by changing the $listStyle parameter with the name of your numbering style.
$phpWord->addNumberingStyle( 'multilevel', array('type' => 'multilevel', 'levels' => array( array('format' => 'decimal', 'text' => '%1.', 'left' => 360, 'hanging' => 360, 'tabPos' => 360), array('format' => 'upperLetter', 'text' => '%2.', 'left' => 720, 'hanging' => 360, 'tabPos' => 720), ) ) ); $section->addListItem('List Item I', 0, null, 'multilevel'); $section->addListItem('List Item I.a', 1, null, 'multilevel'); $section->addListItem('List Item I.b', 1, null, 'multilevel'); $section->addListItem('List Item II', 0, null, 'multilevel');
Tables
To add tables, rows, and cells, use the addTable, addRow, and addCell methods:
$table = $section->addTable([$tableStyle]); $table->addRow([$height], [$rowStyle]); $cell = $table->addCell($width, [$cellStyle]);
Table style can be defined with addTableStyle:
$tableStyle = array( 'borderColor' => '006699', 'borderSize' => 6, 'cellMargin' => 50 ); $firstRowStyle = array('bgColor' => '66BBFF'); $phpWord->addTableStyle('myTable', $tableStyle, $firstRowStyle); $table = $section->addTable('myTable');
Cell span
You can span a cell on multiple columns by using gridSpan or multiple rows by using vMerge.
$cell = $table->addCell(200); $cell->getStyle()->setGridSpan(5);
See Sample_09_Tables.php for more code sample.
Images
To add an image, use the addImage method to sections, headers, footers, textruns, or table cells.
$section->addImage($src, [$style]);
- source String path to a local image or URL of a remote image
- styles Array fo styles for the image. See below.
Examples:
$section = $phpWord->addSection(); $section->addImage( 'mars.jpg', array( 'width' => 100, 'height' => 100, 'marginTop' => -1, 'marginLeft' => -1, 'wrappingStyle' => 'behind' ) ); $footer = $section->addFooter(); $footer->addImage('http://example.com/image.php'); $textrun = $section->addTextRun(); $textrun->addImage('http://php.net/logo.jpg');
Watermarks
To add a watermark (or page background image), your section needs a header reference. After creating a header, you can use the addWatermark method to add a watermark.
$section = $phpWord->addSection(); $header = $section->addHeader(); $header->addWatermark('resources/_earth.jpg', array('marginTop' => 200, 'marginLeft' => 55));
Objects
You can add OLE embeddings, such as Excel spreadsheets or PowerPoint presentations to the document by using addObject method.
$section->addObject($src, [$style]);
Table of contents
To add a table of contents (TOC), you can use the addTOC method. Your TOC can only be generated if you have add at least one title (See «Titles»).
$section->addTOC([$fontStyle], [$tocStyle], [$minDepth], [$maxDepth]);
$fontStyle: See font style section$tocStyle: See available options below$minDepth: Minimum depth of header to be shown. Default 1$maxDepth: Maximum depth of header to be shown. Default 9
Options for $tocStyle:
tabLeaderFill type between the title text and the page number. Use the defined constants in PHPWord_Style_TOC.tabPosThe position of the tab where the page number appears in twips.indentThe indent factor of the titles in twips.
Footnotes & endnotes
You can create footnotes with addFootnote and endnotes with addEndnote in texts or textruns, but it’s recommended to use textrun to have better layout. You can use addText, addLink, addTextBreak, addImage, addObject on footnotes and endnotes.
On textrun:
$textrun = $section->addTextRun(); $textrun->addText('Lead text.'); $footnote = $textrun->addFootnote(); $footnote->addText('Footnote text can have '); $footnote->addLink('http://test.com', 'links'); $footnote->addText('.'); $footnote->addTextBreak(); $footnote->addText('And text break.'); $textrun->addText('Trailing text.'); $endnote = $textrun->addEndnote(); $endnote->addText('Endnote put at the end');
On text:
$section->addText('Lead text.'); $footnote = $section->addFootnote(); $footnote->addText('Footnote text.');
The footnote reference number will be displayed with decimal number starting from 1. This number use FooterReference style which you can redefine by addFontStyle method. Default value for this style is array('superScript' => true);
Checkboxes
Checkbox elements can be added to sections or table cells by using addCheckBox.
$section->addCheckBox($name, $text, [$fontStyle], [$paragraphStyle])
$nameName of the check box.$textText following the check box$fontStyleSee «Font style» section.$paragraphStyleSee «Paragraph style» section.
Textboxes
To be completed.
Fields
To be completed.
Lines
To be completed.
Shapes
To be completed.
Charts
To be completed.
Form fields
To be completed.
Styles
Section
Below are the available styles for section:
orientationPage orientation, i.e. ‘portrait’ (default) or ‘landscape’marginTopPage margin top in twipsmarginLeftPage margin left in twipsmarginRightPage margin right in twipsmarginBottomPage margin bottom in twipsborderTopSizeBorder top size in twipsborderTopColorBorder top colorborderLeftSizeBorder left size in twipsborderLeftColorBorder left colorborderRightSizeBorder right size in twipsborderRightColorBorder right colorborderBottomSizeBorder bottom size in twipsborderBottomColorBorder bottom colorheaderHeightSpacing to top of headerfooterHeightSpacing to bottom of footergutterPage gutter spacingcolsNumNumber of columnscolsSpaceSpacing between columnsbreakTypeSection break type (nextPage, nextColumn, continuous, evenPage, oddPage)
The following two styles are automatically set by the use of the orientation style. You can alter them but that’s not recommended.
pageSizeWPage width in twipspageSizeHPage height in twips
Font
Available font styles:
nameFont name, e.g. ArialsizeFont size, e.g. 20, 22,hintFont content type, default, eastAsia, or csboldBold, true or falseitalicItalic, true or falsesuperScriptSuperscript, true or falsesubScriptSubscript, true or falseunderlineUnderline, dash, dotted, etc.strikethroughStrikethrough, true or falsedoubleStrikethroughDouble strikethrough, true or falsecolorFont color, e.g. FF0000fgColorFont highlight color, e.g. yellow, green, bluebgColorFont background color, e.g. FF0000smallCapsSmall caps, true or falseallCapsAll caps, true or false
Paragraph
Available paragraph styles:
alignParagraph alignment, left, right or centerspaceBeforeSpace before paragraphspaceAfterSpace after paragraphindentIndent by how muchhangingHanging by how muchbasedOnParent stylenextStyle for next paragraphwidowControlAllow first/last line to display on a separate page, true or falsekeepNextKeep paragraph with next paragraph, true or falsekeepLinesKeep all lines on one page, true or falsepageBreakBeforeStart paragraph on next page, true or falselineHeighttext line height, e.g. 1.0, 1.5, ect…tabsSet of custom tab stops
Table
Table styles:
widthTable width in percentbgColorBackground color, e.g. ‘9966CC’border(Top|Right|Bottom|Left)SizeBorder size in twipsborder(Top|Right|Bottom|Left)ColorBorder color, e.g. ‘9966CC’cellMargin(Top|Right|Bottom|Left)Cell margin in twips
Row styles:
tblHeaderRepeat table row on every new page, true or falsecantSplitTable row cannot break across pages, true or falseexactHeightRow height is exact or at least
Cell styles:
widthCell width in twipsvalignVertical alignment, top, center, both, bottomtextDirectionDirection of textbgColorBackground color, e.g. ‘9966CC’border(Top|Right|Bottom|Left)SizeBorder size in twipsborder(Top|Right|Bottom|Left)ColorBorder color, e.g. ‘9966CC’gridSpanNumber of columns spannedvMergerestart or continue
Image
Available image styles:
widthWidth in pixelsheightHeight in pixelsalignImage alignment, left, right, or centermarginTopTop margin in inches, can be negativemarginLeftLeft margin in inches, can be negativewrappingStyleWrapping style, inline, square, tight, behind, or infront
Numbering level
startStarting valueformatNumbering format bullet|decimal|upperRoman|lowerRoman|upperLetter|lowerLetterrestartRestart numbering level symbolsuffixContent between numbering symbol and paragraph text tab|space|nothingtextNumbering level text e.g. %1 for nonbullet or bullet characteralignNumbering symbol align left|center|right|bothleftSee paragraph stylehangingSee paragraph styletabPosSee paragraph stylefontFont namehintSee font style
Templates processing
You can create a .docx document template with included search-patterns which can be replaced by any value you wish. Only single-line values can be replaced.
To deal with a template file, use new TemplateProcessor statement. After TemplateProcessor instance creation the document template is copied into the temporary directory. Then you can use TemplateProcessor::setValue method to change the value of a search pattern. The search-pattern model is: ${search-pattern}.
Example:
$templateProcessor = new TemplateProcessor('Template.docx'); $templateProcessor->setValue('Name', 'Somebody someone'); $templateProcessor->setValue('Street', 'Coming-Undone-Street 32');
It is not possible to directly add new OOXML elements to the template file being processed, but it is possible to transform main document part of the template using XSLT (see TemplateProcessor::applyXslStyleSheet).
See Sample_07_TemplateCloneRow.php for example on how to create multirow from a single row in a template by using TemplateProcessor::cloneRow.
See Sample_23_TemplateBlock.php for example on how to clone a block of text using TemplateProcessor::cloneBlock and delete a block of text using TemplateProcessor::deleteBlock.
Writers & readers
OOXML
The package of OOXML document consists of the following files.
- _rels/
- .rels
- docProps/
- app.xml
- core.xml
- custom.xml
- word/
- rels/
- document.rels.xml
- media/
- theme/
- theme1.xml
- document.xml
- fontTable.xml
- numbering.xml
- settings.xml
- styles.xml
- webSettings.xml
- rels/
- [Content_Types].xml
OpenDocument
Package
The package of OpenDocument document consists of the following files.
- META-INF/
- manifest.xml
- Pictures/
- content.xml
- meta.xml
- styles.xml
content.xml
The structure of content.xml is described below.
- office:document-content
- office:font-facedecls
- office:automatic-styles
- office:body
- office:text
- draw:*
- office:forms
- table:table
- text:list
- text:numbered-paragraph
- text:p
- text:table-of-contents
- text:section
- office:chart
- office:image
- office:drawing
- office:text
styles.xml
The structure of styles.xml is described below.
- office:document-styles
- office:styles
- office:automatic-styles
- office:master-styles
- office:master-page
RTF
To be completed.
HTML
To be completed.
To be completed.
Recipes
Create float left image
Use absolute positioning relative to margin horizontally and to line vertically.
$imageStyle = array( 'width' => 40, 'height' => 40 'wrappingStyle' => 'square', 'positioning' => 'absolute', 'posHorizontalRel' => 'margin', 'posVerticalRel' => 'line', ); $textrun->addImage('resources/_earth.jpg', $imageStyle); $textrun->addText($lipsumText);
Download the produced file automatically
Use php://output as the filename.
$phpWord = new PhpOfficePhpWordPhpWord(); $section = $phpWord->createSection(); $section->addText('Hello World!'); $file = 'HelloWorld.docx'; header("Content-Description: File Transfer"); header('Content-Disposition: attachment; filename="' . $file . '"'); header('Content-Type: application/vnd.openxmlformats-officedocument.wordprocessingml.document'); header('Content-Transfer-Encoding: binary'); header('Cache-Control: must-revalidate, post-check=0, pre-check=0'); header('Expires: 0'); $xmlWriter = PhpOfficePhpWordIOFactory::createWriter($phpWord, 'Word2007'); $xmlWriter->save("php://output");
Create numbered headings
Define a numbering style and title styles, and match the two styles (with pStyle and numStyle) like below.
$phpWord->addNumberingStyle( 'hNum', array('type' => 'multilevel', 'levels' => array( array('pStyle' => 'Heading1', 'format' => 'decimal', 'text' => '%1'), array('pStyle' => 'Heading2', 'format' => 'decimal', 'text' => '%1.%2'), array('pStyle' => 'Heading3', 'format' => 'decimal', 'text' => '%1.%2.%3'), ) ) ); $phpWord->addTitleStyle(1, array('size' => 16), array('numStyle' => 'hNum', 'numLevel' => 0)); $phpWord->addTitleStyle(2, array('size' => 14), array('numStyle' => 'hNum', 'numLevel' => 1)); $phpWord->addTitleStyle(3, array('size' => 12), array('numStyle' => 'hNum', 'numLevel' => 2)); $section->addTitle('Heading 1', 1); $section->addTitle('Heading 2', 2); $section->addTitle('Heading 3', 3);
Add a link within a title
Apply ‘HeadingN’ paragraph style to TextRun or Link. Sample code:
$phpWord = new PhpOfficePhpWordPhpWord(); $phpWord->addTitleStyle(1, array('size' => 16, 'bold' => true)); $phpWord->addTitleStyle(2, array('size' => 14, 'bold' => true)); $phpWord->addFontStyle('Link', array('color' => '0000FF', 'underline' => 'single')); $section = $phpWord->addSection(); // Textrun $textrun = $section->addTextRun('Heading1'); $textrun->addText('The '); $textrun->addLink('https://github.com/PHPOffice/PHPWord', 'PHPWord', 'Link'); // Link $section->addLink('https://github.com/', 'GitHub', 'Link', 'Heading2');
Remove [Compatibility Mode] text in the MS Word title bar
Use the MetadataCompatibilitysetOoxmlVersion(n) method with n is the version of Office (14 = Office 2010, 15 = Office 2013).
$phpWord->getCompatibility()->setOoxmlVersion(15);
Frequently asked questions
Is this the same with PHPWord that I found in CodePlex?
No. This one is much better with tons of new features that you can’t find in PHPWord 0.6.3. The development in CodePlex is halted and switched to GitHub to allow more participation from the crowd. The more the merrier, right?
I’ve been running PHPWord from CodePlex flawlessly, but I can’t use the latest PHPWord from GitHub. Why?
PHPWord requires PHP 5.3+ since 0.8, while PHPWord 0.6.3 from CodePlex can run with PHP 5.2. There’s a lot of new features that we can get from PHP 5.3 and it’s been around since 2009! You should upgrade your PHP version to use PHPWord 0.8+.
References
ISO/IEC 29500, Third edition, 2012-09-01
- Part 1: Fundamentals and Markup Language Reference
- Part 2: Open Packaging Conventions
- Part 3: Markup Compatibility and Extensibility
- Part 4: Transitional Migration Features
Formal specifications
- Oasis OpenDocument Standard Version 1.2
- Rich Text Format (RTF) Specification, version 1.9.1
Other resources
- DocumentFormat.OpenXml.Wordprocessing Namespace on MSDN
Можно ли читать и записывать файлы Word (2003 и 2007) на PHP без использования COM-объекта? Я знаю, что можно сделать так:
$file = fopen(‘c:file.doc’, ‘w+’);
fwrite($file, $text);
fclose();
но Word будет читать его как HTML-файл, а не как собственный файл .doc.
Ответ 1
Чтение двоичных документов Word потребовало бы создания анализатора в соответствии с опубликованными спецификациями формата файлов DOC. Я думаю, что это не является реально выполнимым решением. Вы можете использовать форматы Microsoft Office XML для чтения и записи файлов Word — они совместимы с версиями Word 2003 и 2007. Для чтения необходимо убедиться, что документы Word сохранены в правильном формате (он называется Word 2003 XML-Document в Word 2007). Для записи достаточно следовать общедоступной XML-схеме. Я никогда не использовал этот формат для записи документов Office из PHP, но я использую его для чтения рабочего листа Excel (естественно, сохраненного как XML-Spreadsheet 2003) и отображения его данных на веб-странице. Поскольку файлы представляют собой обычные XML-данные, не составляет труда сориентироваться в них и понять, как извлечь нужные данные. Другой вариант — вариант только для Word 2007 (если форматы файлов OpenXML не установлены в вашем Word 2003) — это пересортировка в OpenXML. Формат файла DOCX — это просто ZIP-архив с включенными XML-файлами. На MSDN есть много ресурсов по формату файлов OpenXML, так что вы должны быть в состоянии понять, как читать нужные вам данные. Запись будет намного сложнее, я думаю, все зависит от того, сколько времени вы потратите на это. Возможно, вы можете взглянуть на PHPExcel — библиотеку, способную писать в файлы Excel 2007 и читать из файлов Excel 2007, используя стандарт OpenXML. Вы можете получить представление о работе, связанной с чтением и записью документов OpenXML Word.
Ответ 2
Данное решение работает с vs < office 2007, и это чистый PHP без всякого COM:
<?php
/*****************************************************************
Этот подход использует обнаружение NUL (chr(00)) и конца строки (chr(13))
чтобы определить, где находится текст:
— разделяем содержимое файла на фрагменты по chr(13)
— отбрасываем все фрагменты, содержащие NUL
— сшиваем оставшиеся вместе
— очищаем с помощью регулярного выражения
*****************************************************************/
function parseWord($userDoc) {
$fileHandle = fopen($userDoc, «r»);
$line = @fread($fileHandle, filesize($userDoc));
$lines = explode(chr(0x0D),$line);
$outtext = «»;
foreach($lines as $thisline) {
$pos = strpos($thisline, chr(0x00));
if (($pos !== FALSE)||(strlen($thisline)==0)) {
} else {
$outtext .= $thisline.» «;
}
}
$outtext = preg_replace(«/[^a-zA-Z0-9s,.-nrt@/_()]/»,»»,$outtext);
return $outtext;
}
$userDoc = «cv.doc»;
$text = parseWord($userDoc);
echo $text;
?>
Ответ 3
Просто обновляем код из предыдущего ответа:
<?php
/*****************************************************************
Этот подход использует обнаружение NUL (chr(00)) и конца строки (chr(13))
чтобы определить, где находится текст:
— разделяем содержимое файла на фрагменты по chr(13)
— отбрасываем все фрагменты, содержащие NUL
— сшиваем оставшиеся вместе
— очищаем с помощью регулярного выражения
*****************************************************************/
function parseWord($userDoc) {
$fileHandle = fopen($userDoc, «r»);
$word_text = @fread($fileHandle, filesize($userDoc));
$line = «»;
$tam = filesize($userDoc);
$nulos = 0;
$caracteres = 0;
for($i=1536; $i<$tam; $i++) {
$line .= $word_text[$i];
if( $word_text[$i] == 0) {
$nulos++;
} else {
$nulos=0;
$caracteres++;
}
if( $nulos>1996)
{
break;
}
}
//echo $caracteres;
$lines = explode(chr(0x0D),$line);
//$outtext = «<pre>»;
$outtext = «»;
foreach($lines as $thisline) {
$tam = strlen($thisline);
if( !$tam ) {
continue;
}
$new_line = «»;
for($i=0; $i<$tam; $i++) {
$onechar = $thisline[$i];
if( $onechar > chr(240) ) {
continue;
}
if( $onechar >= chr(0x20) ) {
$caracteres++;
$new_line .= $onechar;
}
if( $onechar == chr(0x14) ) {
$new_line .= «</a>»;
}
if( $onechar == chr(0x07) ) {
$new_line .= «t»;
if( isset($thisline[$i+1]) ) {
if( $thisline[$i+1] == chr(0x07) ) {
$new_line .= «n»;
}
}
}
}
//troca por hiperlink
$new_line = str_replace(«HYPERLINK» ,»<a href=»,$new_line);
$new_line = str_replace(«o» ,»>»,$new_line);
$new_line .= «n»;
//link de imagens
$new_line = str_replace(«INCLUDEPICTURE» ,»<br><img src=»,$new_line);
$new_line = str_replace(«*» ,»><br>»,$new_line);
$new_line = str_replace(«MERGEFORMATINET» ,»»,$new_line);
$outtext .= nl2br($new_line);
}
return $outtext;
}
$userDoc = «custo.doc»;
$userDoc = «Cultura.doc»;
$text = parseWord($userDoc);
echo $text;
?>
Ответ 4
www.phplivedocx.org — это сервис на основе SOAP, который выполняет онлайн—тестирование файлов. Файлы также имеют достаточно примеров для его использования. Я думаю, что без COM это просто невозможно на Linux—сервере, и единственная идея — изменить doc файл в другой файл, который PHP может разобрать…
Ответ 5
Используя Open XML SDK и VSTO [Visual Studio Tools For Office], мы можем легко работать с файлами Word, манипулировать ими и даже конвертировать внутри в различные форматы, такие как .odt,.pdf,.docx и т. д. Итак, зайдите на сайт msdn.microsoft.com и внимательно изучите вкладку office development. Это самый простой способ сделать это, так как все функции, которые нам нужно реализовать, уже доступны в .net!!! Но так как вы хотите сделать свой проект на PHP, вы можете сделать это в Visual Studio и .net, потому как PHP также является одним из .net Compliant Language!!!
Ответ 6
Используйте следующий класс непосредственно для чтения документа Word:
class DocxConversion{
private $filename;
public function __construct($filePath) {
$this->filename = $filePath;
}
private function read_doc() {
$fileHandle = fopen($this->filename, «r»);
$line = @fread($fileHandle, filesize($this->filename));
$lines = explode(chr(0x0D),$line);
$outtext = «»;
foreach($lines as $thisline) {
$pos = strpos($thisline, chr(0x00));
if (($pos !== FALSE)||(strlen($thisline)==0)) {
} else {
$outtext .= $thisline.» «;
}
}
$outtext = preg_replace(«/[^a-zA-Z0-9s,.-nrt@/_()]/»,»»,$outtext);
return $outtext;
}
private function read_docx(){
$striped_content = »;
$content = »;
$zip = zip_open($this->filename);
if (!$zip || is_numeric($zip)) return false;
while ($zip_entry = zip_read($zip)) {
if (zip_entry_open($zip, $zip_entry) == FALSE) continue;
if (zip_entry_name($zip_entry) != «word/document.xml») continue;
$content .= zip_entry_read($zip_entry, zip_entry_filesize($zip_entry));
zip_entry_close($zip_entry);
}// end while
zip_close($zip);
$content = str_replace(‘</w:r></w:p></w:tc><w:tc>’, » «, $content);
$content = str_replace(‘</w:r></w:p>’, «rn», $content);
$striped_content = strip_tags($content);
return $striped_content;
}
/************************excel sheet************************************/
function xlsx_to_text($input_file){
$xml_filename = «xl/sharedStrings.xml»; //content file name
$zip_handle = new ZipArchive;
$output_text = «»;
if(true === $zip_handle->open($input_file)){
if(($xml_index = $zip_handle->locateName($xml_filename)) !== false){
$xml_datas = $zip_handle->getFromIndex($xml_index);
$xml_handle = DOMDocument::loadXML($xml_datas, LIBXML_NOENT | LIBXML_XINCLUDE | LIBXML_NOERROR | LIBXML_NOWARNING);
$output_text = strip_tags($xml_handle->saveXML());
}else{
$output_text .=»»;
}
$zip_handle->close();
}else{
$output_text .=»»;
}
return $output_text;
}
/*************************power point files*****************************/
function pptx_to_text($input_file){
$zip_handle = new ZipArchive;
$output_text = «»;
if(true === $zip_handle->open($input_file)){
$slide_number = 1; //loop through slide files
while(($xml_index = $zip_handle->locateName(«ppt/slides/slide».$slide_number.».xml»)) !== false){
$xml_datas = $zip_handle->getFromIndex($xml_index);
$xml_handle = DOMDocument::loadXML($xml_datas, LIBXML_NOENT | LIBXML_XINCLUDE | LIBXML_NOERROR | LIBXML_NOWARNING);
$output_text .= strip_tags($xml_handle->saveXML());
$slide_number++;
}
if($slide_number == 1){
$output_text .=»»;
}
$zip_handle->close();
}else{
$output_text .=»»;
}
return $output_text;
}
public function convertToText() {
if(isset($this->filename) && !file_exists($this->filename)) {
return «File Not exists»;
}
$fileArray = pathinfo($this->filename);
$file_ext = $fileArray[‘extension’];
if($file_ext == «doc» || $file_ext == «docx» || $file_ext == «xlsx» || $file_ext == «pptx») {
if($file_ext == «doc») {
return $this->read_doc();
} elseif($file_ext == «docx») {
return $this->read_docx();
} elseif($file_ext == «xlsx») {
return $this->xlsx_to_text();
}elseif($file_ext == «pptx») {
return $this->pptx_to_text();
}
} else {
return «Invalid File Type»;
}
}
}
$docObj = new DocxConversion(«test.docx»); //замените имя документа правильным расширением doc или docx
echo $docText= $docObj->convertToText();
Время на прочтение
4 мин
Количество просмотров 61K
Недавно возникла задача получения чистого текста из различных форматов документооборота — будь-то документы Microsoft Word или PDF. Задача была выполнена даже с чуть более широким списком возможных входных данных. Итак, этой статьёй я открываю список публикаций о чтении текста из следующих типов файлов: DOC, DOCX, RTF, ODT и PDF — с помощью PHP без использования сторонних утилит.
Для начала отвечу на вполне разумный вопрос: «Зачем это, собственно, надо?» Правильно, чистый текст, полученный из, к примеру, документа Word представляет собой достаточно перемешанную кашу. Но этого «бардака» вполне достаточно для построения, например, индекса для поиска по обширному хранилищу офисных документов.
Другой вполне разумный вопрос: «Почему не использовать сторонние утилиты, например, antiword или xpdf, ну или в крайнем случае OLE под Windows?» Таковы уж были поставленные условия, да и OLE работает люто-бешено медленно, даже если задачу можно решить с помощью этой технологии.
Сегодня, в качестве «затравки», я расскажу о достаточно простых для поставленной задачи форматах — это Office Open XML, больше известный как DOCX от Microsoft и OpenDocument Format, он же ODT от ODF Aliance.
Для начала заглянем вовнутрь парочки файлов и увидим буквально следующее (сзади docx, спереди odt):
Самое важное, что мы здесь видим, это первые два символа PK в начале данных. Это значит, что оба файла представляют собой переименованный в .docx/.odt zip-архив. Открываем, например, по Ctrl+PageDown в Total Commander и лицезреем вполне приемлемую структуру (слева odt, справа docx):
Итак, нужные нам файлы с данными — это content.xml в ODT и word/document.xml в DOCX. Чтобы прочитать текстовые данные из них напишем несложный код:
- function odt2text($filename) {
- return getTextFromZippedXML($filename, «content.xml»);
- }
- function docx2text($filename) {
- return getTextFromZippedXML($filename, «word/document.xml»);
- }
- function getTextFromZippedXML($archiveFile, $contentFile) {
- // Создаёт «реинкарнацию» zip-архива…
- $zip = new ZipArchive;
- // И пытаемся открыть переданный zip-файл
- if ($zip->open($archiveFile)) {
- // В случае успеха ищем в архиве файл с данными
- if (($index = $zip->locateName($contentFile)) !== false) {
- // Если находим, то читаем его в строку
- $content = $zip->getFromIndex($index);
- // Закрываем zip-архив, он нам больше не нужен
- $zip->close();
- // После этого подгружаем все entity и по возможности include’ы других файлов
- // Проглатываем ошибки и предупреждения
- $xml = DOMDocument::loadXML($content, LIBXML_NOENT | LIBXML_XINCLUDE | LIBXML_NOERROR | LIBXML_NOWARNING);
- // После чего возвращаем данные без XML-тегов форматирования
- return strip_tags($xml->saveXML());
- }
- $zip->close();
- }
- // Если что-то пошло не так, возвращаем пустую строку
- return «»;
- }
Всего каких-то 30 строк, и мы получаем текстовые данные из файла. Код работает под PHP 5.2+ и требует php_zip.dll под Windows или ключика --enable-zip под Linux. При отсутствии возможности использования ZipArchive (старая версия PHP или отсутствие библиотек) вполне может сгодиться библиотека PclZip, реализующая чтение zip-файлов без соответствующих средств в системе.
Отмечу, что данный код является лишь заготовкой для решения задач чтения текста. После череды статей под лозунгом «Текст любой ценой», я постараюсь описать принципы и реализацию чтения форматированного текста.
По теме:
- msdn.microsoft.com/en-us/library/aa338205.aspx
- www.i-rs.ru/Produkty/ODF-ISO-IEC-26300-2006/Dokumentaciya/Format-Open-Document-dlya-ofisnyh-prilozhenij-OpenDocument-v1.0.odt
- Текст любой ценой: PDF
- Текст любой ценой: RTF
- Текст любой ценой: WCBFF и DOC
В следующий раз я расскажу о чтении текста из PDF без помощи xpdf. Более сложной, но вполне посильной для PHP задачи.





