Dutch PHP Conference 2025 - Call For Papers

The DOMDocument class

(PHP 5, PHP 7, PHP 8)

简介

Represents an entire HTML or XML document; serves as the root of the document tree.

类摘要

class DOMDocument extends DOMNode implements DOMParentNode {
/* 属性 */
public readonly ?DOMDocumentType $doctype;
public readonly ?DOMElement $documentElement;
public readonly ?string $actualEncoding;
public ?string $encoding;
public readonly ?string $xmlEncoding;
public ?string $version;
public readonly mixed $config;
public bool $recover;
public readonly ?DOMElement $firstElementChild;
public readonly ?DOMElement $lastElementChild;
public readonly int $childElementCount;
/* 继承的属性 */
public readonly string $nodeName;
public readonly int $nodeType;
public readonly ?DOMNode $parentNode;
public readonly ?DOMElement $parentElement;
public readonly DOMNodeList $childNodes;
public readonly ?DOMNode $firstChild;
public readonly ?DOMNode $lastChild;
public readonly ?DOMNode $previousSibling;
public readonly ?DOMNode $nextSibling;
public readonly ?DOMNamedNodeMap $attributes;
public readonly bool $isConnected;
public readonly ?DOMDocument $ownerDocument;
public readonly ?string $namespaceURI;
public string $prefix;
public readonly ?string $localName;
public readonly ?string $baseURI;
/* 方法 */
public __construct(string $version = "1.0", string $encoding = "")
public append(DOMNode|string ...$nodes): void
public createAttribute(string $localName): DOMAttr|false
public createAttributeNS(?string $namespace, string $qualifiedName): DOMAttr|false
public createElement(string $localName, string $value = ""): DOMElement|false
public createElementNS(?string $namespace, string $qualifiedName, string $value = ""): DOMElement|false
public getElementById(string $elementId): ?DOMElement
public getElementsByTagName(string $qualifiedName): DOMNodeList
public getElementsByTagNameNS(?string $namespace, string $localName): DOMNodeList
public importNode(DOMNode $node, bool $deep = false): DOMNode|false
public load(string $filename, int $options = 0): bool
public loadHTML(string $source, int $options = 0): bool
public loadHTMLFile(string $filename, int $options = 0): bool
public loadXML(string $source, int $options = 0): bool
public prepend(DOMNode|string ...$nodes): void
public registerNodeClass(string $baseClass, ?string $extendedClass): bool
public relaxNGValidate(string $filename): bool
public replaceChildren(DOMNode|string ...$nodes): void
public save(string $filename, int $options = 0): int|false
public saveHTML(?DOMNode $node = null): string|false
public saveHTMLFile(string $filename): int|false
public saveXML(?DOMNode $node = null, int $options = 0): string|false
public schemaValidate(string $filename, int $flags = 0): bool
public schemaValidateSource(string $source, int $flags = 0): bool
public validate(): bool
public xinclude(int $options = 0): int|false
/* 继承的方法 */
public DOMNode::C14N(
    bool $exclusive = false,
    bool $withComments = false,
    ?array $xpath = null,
    ?array $nsPrefixes = null
): string|false
public DOMNode::C14NFile(
    string $uri,
    bool $exclusive = false,
    bool $withComments = false,
    ?array $xpath = null,
    ?array $nsPrefixes = null
): int|false
public DOMNode::isEqualNode(?DOMNode $otherNode): bool
public DOMNode::isSameNode(DOMNode $otherNode): bool
public DOMNode::isSupported(string $feature, string $version): bool
}

属性

actualEncoding

Deprecated. Actual encoding of the document, is a readonly equivalent to encoding.

childElementCount

The number of child elements.

config

Deprecated. Configuration used when DOMDocument::normalizeDocument() is invoked.

doctype

The Document Type Declaration associated with this document.

documentElement

The DOMElement object that is the first document element. If not found, this evaluates to null.

documentURI

The location of the document or null if undefined.

encoding

Encoding of the document, as specified by the XML declaration. This attribute is not present in the final DOM Level 3 specification, but is the only way of manipulating XML document encoding in this implementation.

firstElementChild

First child element or null.

formatOutput

Nicely formats output with indentation and extra space. This has no effect if the document was loaded with preserveWhitespace enabled.

implementation

The DOMImplementation object that handles this document.

lastElementChild

Last child element or null.

preserveWhiteSpace

Do not remove redundant white space. Default to true. Setting this to false has the same effect as passing LIBXML_NOBLANKS as option to DOMDocument::load() etc.

recover

Proprietary. Enables recovery mode, i.e. trying to parse non-well formed documents. This attribute is not part of the DOM specification and is specific to libxml.

resolveExternals

Set it to true to load external entities from a doctype declaration. This is useful for including character entities in your XML document.

standalone

Deprecated. Whether or not the document is standalone, as specified by the XML declaration, corresponds to xmlStandalone.

strictErrorChecking

Throws DOMException on errors. Default to true.

substituteEntities

Proprietary. Whether or not to substitute entities. This attribute is not part of the DOM specification and is specific to libxml. Default to false.

警告

Enabling entity substitution may facilitate XML External Entity (XXE) attacks.

validateOnParse

Loads and validates against the DTD. Default to false.

警告

Enabling validating the DTD may facilitate XML External Entity (XXE) attacks.

version

Deprecated. Version of XML, corresponds to xmlVersion.

xmlEncoding

An attribute specifying, as part of the XML declaration, the encoding of this document. This is null when unspecified or when it is not known, such as when the Document was created in memory.

xmlStandalone

An attribute specifying, as part of the XML declaration, whether this document is standalone. This is false when unspecified. A standalone document is one where there are no external markup declarations. An example of such a markup declaration is when the DTD declares an attribute with a default value.

xmlVersion

An attribute specifying, as part of the XML declaration, the version number of this document. If there is no declaration and if this document supports the "XML" feature, the value is "1.0".

更新日志

版本 说明
8.0.0 DOMDocument implements DOMParentNode now.
8.0.0 The unimplemented method DOMDocument::renameNode() has been removed.

注释

注意:

此 DOM 扩展采用 UTF-8 编码。使用 mb_convert_encoding()UConverter::transcode()iconv() 来处理其它编码。

注意:

When using json_encode() on a DOMDocument object the result will be that of encoding an empty object.

目录

add a note

User Contributed Notes 18 notes

up
113
Fernando H
16 years ago
Showing a quick example of how to use this class, just so that new users can get a quick start without having to figure it all out by themself. ( At the day of posting, this documentation just got added and is lacking examples. )

<?php

// Set the content type to be XML, so that the browser will recognise it as XML.
header( "content-type: application/xml; charset=ISO-8859-15" );

// "Create" the document.
$xml = new DOMDocument( "1.0", "ISO-8859-15" );

// Create some elements.
$xml_album = $xml->createElement( "Album" );
$xml_track = $xml->createElement( "Track", "The ninth symphony" );

// Set the attributes.
$xml_track->setAttribute( "length", "0:01:15" );
$xml_track->setAttribute( "bitrate", "64kb/s" );
$xml_track->setAttribute( "channels", "2" );

// Create another element, just to show you can add any (realistic to computer) number of sublevels.
$xml_note = $xml->createElement( "Note", "The last symphony composed by Ludwig van Beethoven." );

// Append the whole bunch.
$xml_track->appendChild( $xml_note );
$xml_album->appendChild( $xml_track );

// Repeat the above with some different values..
$xml_track = $xml->createElement( "Track", "Highway Blues" );

$xml_track->setAttribute( "length", "0:01:33" );
$xml_track->setAttribute( "bitrate", "64kb/s" );
$xml_track->setAttribute( "channels", "2" );
$xml_album->appendChild( $xml_track );

$xml->appendChild( $xml_album );

// Parse the XML.
print $xml->saveXML();

?>

Output:
<Album>
<Track length="0:01:15" bitrate="64kb/s" channels="2">
The ninth symphony
<Note>
The last symphony composed by Ludwig van Beethoven.
</Note>
</Track>
<Track length="0:01:33" bitrate="64kb/s" channels="2">Highway Blues</Track>
</Album>

If you want your PHP->DOM code to run under the .xml extension, you should set your webserver up to run the .xml extension with PHP ( Refer to the installation/configuration configuration for PHP on how to do this ).

Note that this:
<?php
$xml
= new DOMDocument( "1.0", "ISO-8859-15" );
$xml_album = $xml->createElement( "Album" );
$xml_track = $xml->createElement( "Track" );
$xml_album->appendChild( $xml_track );
$xml->appendChild( $xml_album );
?>

is NOT the same as this:
<?php
// Will NOT work.
$xml = new DOMDocument( "1.0", "ISO-8859-15" );
$xml_album = new DOMElement( "Album" );
$xml_track = new DOMElement( "Track" );
$xml_album->appendChild( $xml_track );
$xml->appendChild( $xml_album );
?>

although this will work:
<?php
$xml
= new DOMDocument( "1.0", "ISO-8859-15" );
$xml_album = new DOMElement( "Album" );
$xml->appendChild( $xml_album );
?>
up
5
andreas at userbrain dot com
2 years ago
After struggling with parsing and modifying partial HTML content for several hours, I came to this solution which does work for me and is relatively simple compared to what else I found online.

This solution fixes unwanted DOCTYPE and html, body tags as well as encoding issues.

<?php

// Assumption: content is utf-8 encoded
$content = "<h1>This is a heading</h1><p>This is a paragraph</p>";

// Load content to a div and specify encoding with a meta tag
$temp_dom = new DOMDocument();
$temp_dom->loadHTML("<meta http-equiv='Content-Type' content='charset=utf-8' /><div>$content</div>");

// As loadHTML() adds a DOCTYPE as well as <html> and <body> tag, let’s create another DOMDocument and import just the nodes we want
$dom = new DOMDocument();
$first_div = $temp_dom->getElementsByTagName('div')[0];
$first_div_node = $dom->importNode($first_div, true);
$dom->appendChild($first_div_node);

// Do whatever you want to do
$dom->getElementsByTagName('h1')[0]->setAttribute('class', 'happy');

// You could also just echo $dom->saveHtml() if you don’t mind the div and whitespace
echo substr(trim($dom->saveHtml()), 5, -6);

// Outputs: <h1 class="happy">This is a heading</h1><p>This is a paragraph</p>
?>
up
23
developer at nabtron dot com
8 years ago
For those landing here and checking for encoding issue with utf-8 characteres, it's pretty easy to correct it, without adding any additional output tag to your html.

We'll be utilizing: mb_convert_encoding

Thanks to the user who shared: SmartDOMDocument in previous comments, I got the idea of solving it. However I truly wish that he shared the method instead of giving a link.

Anyway coming back to the solution, you can simply use:

<?php

// checks if the content we're receiving isn't empty, to avoid the warning
if ( empty( $content ) ) {
return
false;
}

// converts all special characters to utf-8
$content = mb_convert_encoding($content, 'HTML-ENTITIES', 'UTF-8');

// creating new document
$doc = new DOMDocument('1.0', 'utf-8');

//turning off some errors
libxml_use_internal_errors(true);

// it loads the content without adding enclosing html/body tags and also the doctype declaration
$doc->LoadHTML($content, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);

// do whatever you want to do with this code now

?>

I hope it solves the issue for someone! If you need my help or service to fix your code, you can reach me on nabtron.com or contact me at the email mentioned with this comment.
up
23
jay at jaygilford dot com
14 years ago
Here's a small function I wrote to get all page links using the DOMDocument which will hopefully be of use to others

<?php
/**
* @author Jay Gilford
*/

/**
* get_links()
*
* @param string $url
* @return array
*/
function get_links($url) {

// Create a new DOM Document to hold our webpage structure
$xml = new DOMDocument();

// Load the url's contents into the DOM
$xml->loadHTMLFile($url);

// Empty array to hold all links to return
$links = array();

//Loop through each <a> tag in the dom and add it to the link array
foreach($xml->getElementsByTagName('a') as $link) {
$links[] = array('url' => $link->getAttribute('href'), 'text' => $link->nodeValue);
}

//Return the links
return $links;
}
?>
up
13
tloach at gmail dot com
14 years ago
For anyone else who has been having issues with formatOuput not working, here is a work-around:

rather than just doing something like:

<?php
$outXML
= $xml->saveXML();
?>

force it to reload the XML from scratch, then it will format correctly:

<?php
$outXML
= $xml->saveXML();
$xml = new DOMDocument();
$xml->preserveWhiteSpace = false;
$xml->formatOutput = true;
$xml->loadXML($outXML);
$outXML = $xml->saveXML();
?>
up
3
biker dot mike at gmx dot com
8 years ago
Look out for the following gotcha when loading XML from a string:

<?php
$doc
= new \DOMDocument;
$doc->documentURI = $myXmlFilename;
$doc->loadXML($myXmlString);
?>

documentURI is now set to the value of $myXmlFilename, right?

Wrong!

It's set to the current working directory. If you want to manually set documentURI to something other than the CWD, do so AFTER the call to loadXML().

E.g.:
<?php
$doc
= new \DOMDocument;
$doc->loadXML($myXmlString);
$doc->documentURI = $myXmlFilename;
?>

documentURI really is now set to the value of $myXmlFilename.
up
6
Nick M
13 years ago
You may need to save all or part of a DOMDocument as an XHTML-friendly string, something compliant with both XML and HTML 4. Here's the DOMDocument class extended with a saveXHTML method:

<?php

/**
* XHTML Document
*
* Represents an entire XHTML DOM document; serves as the root of the document tree.
*/
class XHTMLDocument extends DOMDocument {

/**
* These tags must always self-terminate. Anything else must never self-terminate.
*
* @var array
*/
public $selfTerminate = array(
'area','base','basefont','br','col','frame','hr','img','input','link','meta','param'
);

/**
* saveXHTML
*
* Dumps the internal XML tree back into an XHTML-friendly string.
*
* @param DOMNode $node
* Use this parameter to output only a specific node rather than the entire document.
*/
public function saveXHTML(DOMNode $node=null) {

if (!
$node) $node = $this->firstChild;

$doc = new DOMDocument('1.0');
$clone = $doc->importNode($node->cloneNode(false), true);
$term = in_array(strtolower($clone->nodeName), $this->selfTerminate);
$inner='';

if (!
$term) {
$clone->appendChild(new DOMText(''));
if (
$node->childNodes) foreach ($node->childNodes as $child) {
$inner .= $this->saveXHTML($child);
}
}

$doc->appendChild($clone);
$out = $doc->saveXML($clone);

return
$term ? substr($out, 0, -2) . ' />' : str_replace('><', ">$inner<", $out);

}

}

?>

This hasn't been benchmarked, but is probably significantly slower than saveXML or saveHTML and should be used sparingly.
up
1
pastormontesinos at gmail dot com
3 years ago
For using safely with script nodes when parsing, best option is extending DOMDocument, keeping script tags while DOMDocument process and rearrange them just after saveHTML function is called. Here is my custom class.

<?php

class SafeDOMDocument extends \DOMDocument
{
const
REGEX_JS = '#(\s*<!--(\[if[^\n]*>)?\s*(<script.*</script>)+\s*(<!\[endif\])?-->)|(\s*<script.*</script>)#isU';
const
SUBSTITUTION_FORMAT = '<!--<script class="script_%s"></script>-->';
private
$matchedScripts = [];

public function
loadHTML($source, $options = 0)
{
$this->formatOutput = false;
$this->preserveWhiteSpace = true;
$this->validateOnParse = false;
$this->strictErrorChecking = false;
$this->recover = false;
$this->resolveExternals = false;
$this->substituteEntities = false;
$matches = [];
$success = preg_match_all(self::REGEX_JS, $source, $matches);

if (
$success && !empty($matches)) {
foreach (
$matches[0] as $match) {
$storedScript = rtrim(ltrim($match, "\n\r\t "), "\n\r\t ");
$scriptId = md5($storedScript);
$key = sprintf(self::SUBSTITUTION_FORMAT, $scriptId);
$source = str_replace($match, $key, $source);
$this->matchedScripts[$key] = $storedScript;
}
}

return
parent::loadHTML($source, $options);
}

public function
saveHTML(DOMNode $node = null)
{
$output = parent::saveHTML($node);

if (
count($this->matchedScripts)) {
foreach (
$this->matchedScripts as $substitution => $originalSnippet) {
$output = str_replace($substitution, $originalSnippet, $output);
}
}

return
$output;
}
}
?>
up
7
evert at er dot nl
13 years ago
A nice and simple node 2 array I wrote, worth a try ;)

<?php
function getArray($node)
{
$array = false;

if (
$node->hasAttributes())
{
foreach (
$node->attributes as $attr)
{
$array[$attr->nodeName] = $attr->nodeValue;
}
}

if (
$node->hasChildNodes())
{
if (
$node->childNodes->length == 1)
{
$array[$node->firstChild->nodeName] = $node->firstChild->nodeValue;
}
else
{
foreach (
$node->childNodes as $childNode)
{
if (
$childNode->nodeType != XML_TEXT_NODE)
{
$array[$childNode->nodeName][] = $this->getArray($childNode);
}
}
}
}

return
$array;
}
?>
up
4
fcartegnie
14 years ago
Be careful with formatOutput().

Creating an empty node like this:
createElement('foo','')
instead of
createElement('foo')
will break formatOutput.
up
1
cmyk777 at gmail dot com
15 years ago
This function may help to debug current dom element:

<?php
function dom_dump($obj) {
if (
$classname = get_class($obj)) {
$retval = "Instance of $classname, node list: \n";
switch (
true) {
case (
$obj instanceof DOMDocument):
$retval .= "XPath: {$obj->getNodePath()}\n".$obj->saveXML($obj);
break;
case (
$obj instanceof DOMElement):
$retval .= "XPath: {$obj->getNodePath()}\n".$obj->ownerDocument->saveXML($obj);
break;
case (
$obj instanceof DOMAttr):
$retval .= "XPath: {$obj->getNodePath()}\n".$obj->ownerDocument->saveXML($obj);
//$retval .= $obj->ownerDocument->saveXML($obj);
break;
case (
$obj instanceof DOMNodeList):
for (
$i = 0; $i < $obj->length; $i++) {
$retval .= "Item #$i, XPath: {$obj->item($i)->getNodePath()}\n".
"{$obj->item($i)->ownerDocument->saveXML($obj->item($i))}\n";
}
break;
default:
return
"Instance of unknown class";
}
} else {
return
'no elements...';
}
return
htmlspecialchars($retval);
}
?>

Example usage:

<?php
$dom
= new DomDocument();
$dom->load('test.xml');
$body = $dom->documentElement->getElementsByTagName('book');
echo
'<pre>'.dom_dump($body).'<pre>';
?>

Output:

Instance of DOMNodeList, node list:
Item #0, XPath: /library/book[1]
<book isbn="0345342968">
<title>Fahrenheit 451</title>
<author>R. Bradbury</author>
<publisher>Del Rey</publisher>
</book>
Item #1, XPath: /library/book[2]
<book isbn="0048231398">
<title>The Silmarillion</title>
<author>J.R.R. Tolkien</author>
<publisher>G. Allen &amp; Unwin</publisher>
</book>
Item #2, XPath: /library/book[3]
<book isbn="0451524934">
<title>1984</title>
<author>G. Orwell</author>
<publisher>Signet</publisher>
</book>
Item #3, XPath: /library/book[4]
<book isbn="031219126X">
<title>Frankenstein</title>
<author>M. Shelley</author>
<publisher>Bedford</publisher>
</book>
Item #4, XPath: /library/book[5]
<book isbn="0312863551">
<title>The Moon Is a Harsh Mistress</title>
<author>R. A. Heinlein</author>
<publisher>Orb</publisher>
</book>
up
0
610010559 at qq dot com
2 years ago
when you add the new element to formatted XML data through appendChild() method, you would the new element you add is not be formatted(that is not indexed, not line break). here is my solution (in short load the xml without preserve white space, ), example show as below:
<?php
$doc
= new \DOMDocument();
$doc->formatOutput = true;
$doc->preserveWhiteSpace = false;//that is key, default value is true.
$doc->loadXML($xmlStr);
$doc->appendChild($doc->createElement('php', '666'))
$formattedXMLStr = $doc->saveXML();//DOMDocument wold format the xml str for you
echo $formattedXMlStr;
?>
it take me some time to try it out. hope it save your time.
up
1
sites.sitesbr.net
11 years ago
How to objetify a DomDocument with hierarchy like:
<root>
<item>
<prop1>info1</prop1>
<prop2>info2</prop2>
<prop3>info3</prop3>
</item>
<item>
<prop1>info1</prop1>
<prop2>info2</prop2>
<prop3>info3</prop3>
</item>
</root>

It's possible to use in object style to retrieve information, as:

<?php
$theNodeValue
= $aitem->prop1;
?>

Here is the code: one Class and 2 functions.

<?php
class ArrayNode{
public
$nodeName, $nodeValue;
}

function
getChildNodeElements( $domNode ){
$nodes = array();
for(
$i=0; $i < $domNode->childNodes->length; $i++){
$cn = $domNode->childNodes->item($i);
if(
$cn->nodeType == 1){
$nodes[] = $cn;
}
}
return
$nodes;
}

function
getArrayNodes( $domDoc ){
$res = array();

for(
$i=0; $i < $domDoc->childNodes->length; $i++){
$cn = $domDoc->childNodes->item($i);
# The first is the root tag...
if( $cn->nodeType == 1){
# But we want it's childNodes.
$sub_cn = getChildNodeElements( $cn);
# Found the tagName:
$baseItemTagName = $sub_cn[0]->nodeName;
break;
}
}

$dnl = $domDoc->getElementsByTagName( $baseItemTagName);

for(
$i=0; $i< $dnl->length; $i++){
$arrayNode = new ArrayNode();

# Summary
$arrayNode->nodeName = $dnl->item($i)->nodeName;
$arrayNode->nodeValue = $dnl->item($i)->nodeValue;

# Child Nodes
$cn = $dnl->item($i)->childNodes;
for(
$k=0; $k<$cn->length; $k++){
if(
$cn->item($k)->nodeName == "#text" && trim($cn->item($k)->nodeValue) == "") continue;
$arrayNode->{$cn->item($k)->nodeName} = $cn->item($k)->nodeValue;
}

# Attributes
$attr = $dnl->item($i)->attributes;
for(
$k=0; $k < $attr->length; $k++){
if(!
is_null($attr)){
if(
$attr->item($k)->nodeName == "#text" && trim($attr->item($k)->nodeValue) == "") continue;
$arrayNode->{$attr->item($k)->nodeName} = $attr->item($k)->nodeValue;
}
}

$res[] = $arrayNode;

}

return
$res;
}
?>

To use it:

<?php

# First you load a XML in a DomDocument variable.

$url = "/path/to/yourxmlfile.xml";
$domSrc = file_get_contents($url);
$dom = new DomDocument();
$dom->loadXML( $domSrc );

# Then, you get the ArrayNodes from the DomDocument.

$ans = getArrayNodes( $dom );


for(
$i=0; $i < count( $ans ) ; $i++){

$cn = $ans[ $i];

$info1 = $cn->prop1;
$info2 = $cn->prop2;
$info3 = $cn->prop3;

// ...

}

?>
up
0
ashjkshdu283 at gmail dot com
6 years ago
/* Function evolved from jay at jaygilford dot com post
* This function will return an array of the values of the specified
* attribute ($attr) for all the Dom Document object's elements
*/

<?php

function getAttrData(string $attr, DomDocument $dom) {
// Empty array to hold all classes to return
$attrData = array();

//Loop through each tag in the dom and add it's attribute data to the array
foreach($dom->getElementsByTagName('*') as $tag) {
if(empty(
$tag->getAttribute($attr)) === false) {
array_push($attrData, $tag->getAttribute($attr));
}
}

//Return the array of attribute data
return array_unique($attrData);
}

$html = '
<!DOCTYPE html>
<html>
<head>
<title>Page Title</title>
</head>
<body>
<a href="#someLink" id="someLink" class="link-class">Some Link</a>
<a href="#someOtherLink" id="someOtherLink" class="link-class">Some Other Link</a>
<h1 id="header1" class="header-class">My First Heading</h1>
<p id="para1" class="para-class">My first paragraph.</p>
</body>
</html>'
;
$dom = new DOMDocument();
$dom->loadHtml($html);
$dom->saveHTML();
var_dump(getAttrData('class', $dom));
up
0
ingjetel at gmail dot com
9 years ago
Easy function for basic output of XML file via DOM parsing

<?php
$dom
= new DomDocument();
$dom->load("./file.xml") or die("error");
$start = $dom->documentElement;
fc($start);

function
fc($node) {
$child = $node->childNodes;
foreach(
$child as $item) {
if (
$item->nodeType == XML_TEXT_NODE) {
if (
strlen(trim($item->nodeValue))) echo trim($item->nodeValue)."<br/>";
}
else if (
$item->nodeType == XML_ELEMENT_NODE) fc($item);
}
}
?>
up
-1
danny dot nunez15 at gmail dot com
10 years ago
A simple function to grab all links in a page.

function get_links($url) {

// Create a new DOM Document to hold our webpage structure
$xml = new DOMDocument();

// Load the url's contents into the DOM

$xml->loadHTMLFile($url);

// Empty array to hold all links to return
$links = array();

//Loop through each <a> tag in the dom and add it to the link array
foreach ($xml->getElementsByTagName('a') as $link) {
$url = $link->getAttribute('href');
if (!empty($url)) {
$links[] = $link->getAttribute('href');
}
}

//Return the links
return $links;
}
up
-2
admin at beerpla dot net
14 years ago
After seeing many complaints about certain DOMDocument shortcomings, such as bad handling of encodings and always saving HTML fragments with <html>, <head>, and DOCTYPE, I decided that a better solution is needed.

So here it is: SmartDOMDocument. You can find it at http://beerpla.net/projects/smartdomdocument/

Currently, the main highlights are:

- SmartDOMDocument inherits from DOMDocument, so it's very easy to use - just declare an object of type SmartDOMDocument instead of DOMDocument and enjoy the new behavior on top of all existing functionality (see example below).

- saveHTMLExact() - DOMDocument has an extremely badly designed "feature" where if the HTML code you are loading does not contain <html> and <body> tags, it adds them automatically (yup, there are no flags to turn this behavior off).
Thus, when you call $doc->saveHTML(), your newly saved content now has <html><body> and DOCTYPE in it. Not very handy when trying to work with code fragments (XML has a similar problem).
SmartDOMDocument contains a new function called saveHTMLExact() which does exactly what you would want - it saves HTML without adding that extra garbage that DOMDocument does.

- encoding fix - DOMDocument notoriously doesn't handle encoding (at least UTF-8) correctly and garbles the output.
SmartDOMDocument tries to work around this problem by enhancing loadHTML() to deal with encoding correctly. This behavior is transparent to you - just use loadHTML() as you would normally.

- SmartDOMDocument Object As String - you can use a SmartDOMDocument object as a string which will print out its contents.
For example:
<?php
echo "Here is the HTML: $smart_dom_doc";
?>

I'm going to maintain this code and try to fix bugs as they come in.

Enjoy.
up
-5
qrworld.net
9 years ago
In this post http://softontherocks.blogspot.com/2014/11/descargar-el-contenido-de-una-url_11.html I found a simple way to get the content of a URL with DOMDocument, loadHTMLFile and saveHTML().

function getURLContent($url){
$doc = new DOMDocument;
$doc->preserveWhiteSpace = FALSE;
@$doc->loadHTMLFile($url);
return $doc->saveHTML();
}
To Top