downloads | documentation | faq | getting help | mailing lists | licenses | wiki | reporting bugs | php.net sites | links | conferences | my php.net

search for in the

DOMDocument::loadHTMLFile> <DOMDocument::load
[edit] Last updated: Fri, 25 May 2012

view this page in

DOMDocument::loadHTML

(PHP 5)

DOMDocument::loadHTML 文字列から HTML を読み込む

説明

bool DOMDocument::loadHTML ( string $source )

この関数は、文字列 source に含まれる HTML を パースします。XML を読み込む場合とは異なり、妥当な HTML でなくても 読み込むことができます。この関数をスタティックにコールすると、 読み込んだ内容をもとに DOMDocument オブジェクトを作成します。 読み込み前に DOMDocument のプロパティを 設定する必要がない場合に、スタティックに実行することがあるでしょう。

パラメータ

source

HTML 文字列。

返り値

成功した場合に TRUE を、失敗した場合に FALSE を返します。 静的にコールされた場合には DOMDocument を返します。 失敗した場合に FALSE を返します

エラー / 例外

空の文字列を source に渡すと、警告が発生します。 この警告は libxml が発するものではないので、libxml のエラー処理関数では処理できません。

このメソッドは、静的にコールすることも できはしますがE_STRICT エラーが発生します。

壊れた HTML も読み込めますが、マークアップが正しくない場合には E_WARNING が発生します。 このエラーの処理には libxml のエラー処理関数 が使えます。

例1 ドキュメントを作成する

<?php
$doc 
= new DOMDocument();
$doc->loadHTML("<html><body>Test<br></body></html>");
echo 
$doc->saveHTML();
?>

参考



DOMDocument::loadHTMLFile> <DOMDocument::load
[edit] Last updated: Fri, 25 May 2012
 
add a note add a note User Contributed Notes DOMDocument::loadHTML
Alex 10-Apr-2010 08:45
Beware of the "gotcha" (works as designed but not as expected): if you use loadHTML, you cannot validate the document. Validation is only for XML. Details here: http://bugs.php.net/bug.php?id=43771&edit=1
Shane Harter 04-Jan-2010 08:42
DOMDocument is very good at dealing with imperfect markup, but it throws warnings all over the place when it does.

This isn't well documented here. The solution to this is to implement a separate aparatus for dealing with just these errors.

Set libxml_use_internal_errors(true) before calling loadHTML. This will prevent errors from bubbling up to your default error handler. And you can then get at them (if you desire) using other libxml error functions.

You can find more info here http://www.php.net/manual/en/ref.libxml.php
mdmitry at gmail dot com 21-Dec-2009 09:02
You can also load HTML as UTF-8 using this simple hack:

<?php

$doc
= new DOMDocument();
$doc->loadHTML('<?xml encoding="UTF-8">' . $html);

// dirty fix
foreach ($doc->childNodes as $item)
    if (
$item->nodeType == XML_PI_NODE)
       
$doc->removeChild($item); // remove hack
$doc->encoding = 'UTF-8'; // insert proper

?>
piopier 14-Jun-2009 08:29
Here is a function I wrote to capitalize the previous remarks about charset problems (UTF-8...) when using loadHTML and then DOM functions.
It adds the charset meta tag just after <head> to improve automatic encoding detection, converts any specific character to an html entity, thus PHP DOM functions/attributes will return correct values.

<?php
mb_detect_order
("ASCII,UTF-8,ISO-8859-1,windows-1252,iso-8859-15");
function
loadNprepare($url,$encod='') {
       
$content        = file_get_contents($url);
        if (!empty(
$content)) {
                if (empty(
$encod))
                       
$encod  = mb_detect_encoding($content);
               
$headpos        = mb_strpos($content,'<head>');
                if (
FALSE=== $headpos)
                       
$headpos= mb_strpos($content,'<HEAD>');
                if (
FALSE!== $headpos) {
                       
$headpos+=6;
                       
$content = mb_substr($content,0,$headpos) . '<meta http-equiv="Content-Type" content="text/html; charset='.$encod.'">' .mb_substr($content,$headpos);
                }
               
$content=mb_convert_encoding($content, 'HTML-ENTITIES', $encod);
        }
       
$dom = new DomDocument;
       
$res = $dom->loadHTML($content);
        if (!
$res) return FALSE;
        return
$dom;
}
?>

NB: it uses mb_strpos/mb_substr instead of mb_ereg_replace because that seemed more efficient with huge html pages.
Errol 11-Feb-2009 08:05
It should be noted that when any text is provided within the body tag
outside of a containing element, the DOMDocument will encapsulate that
text into a paragraph tag (<p>).

For example:
<?php
$doc
= new DOMDocument();
$doc->loadHTML("<html><body>Test<br><div>Text</div></body></html>");
echo
$doc->saveHTML();
?>

will yield:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
"http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<p>Test<br></p>
<div>Text</div>
</body></html>

while:
<?php
$doc
= new DOMDocument();
$doc->loadHTML(
   
"<html><body><i>Test</i><br><div>Text</div></body></html>");
echo
$doc->saveHTML();
?>

will yield:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
"http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<i>Test</i><br><div>Text</div>
</body></html>
jamesedwardcooke+php at gmail dot com 19-Oct-2008 11:37
Using loadHTML() automagically sets the doctype property of your DOMDocument instance(to the doctype in the html, or defaults to 4.0 Transitional). If you set the doctype with DOMImplementation it will be overridden.

I assumed it was possible to set it and then load html with the doctype I defined(in order to decide the doctype at runtime), and ran into a huge headache trying to find out where my doctype was going. Hopefully this helps someone else.
xuanbn at yahoo dot com 04-Oct-2007 01:38
If you use loadHTML() to process utf HTML string (eg in Vietnamese), you may experience result in garbage text, while some files were OK. Even your HTML already have meta charset  like

  <meta http-equiv="content-type" content="text/html; charset=utf-8">

I have discovered that, to help loadHTML() process utf file correctly, the meta tag should come first, before any utf string appear. For example, this HTML file

<html>
 <head>
    <meta http-equiv="content-type" content="text/html; charset=utf-8">
    <title> Vietnamese - Tiếng Việt</title>
  </head>
<body></body>
</html>

will be OK with loadHTML() when <meta> tag appear <title> tag.

But the file below will not regcornize by loadHTML() because <title> tag contains utf string appear before <meta> tag.

<html>
 <head>
    <title> Vietnamese - Tiếng Việt</title>
    <meta http-equiv="content-type" content="text/html; charset=utf-8">
  </head>
<body></body>
</html>
hanhvansu at yahoo dot com 26-Apr-2007 08:50
When using loadHTML() to process UTF-8 pages, you may meet the problem that the output of dom functions are not like the input. For example, if you want to get "Cạnh tranh", you will receive "Cạnh tranh".  I suggest we use mb_convert_encoding before load UTF-8 page :
<?php
    $pageDom
= new DomDocument();   
   
$searchPage = mb_convert_encoding($htmlUTF8Page, 'HTML-ENTITIES', "UTF-8");
    @
$pageDom->loadHTML($htmlUTF8Page);

?>
romain dot lalaut at laposte dot net 15-Feb-2007 08:31
Note that the elements of such document will have no namespace even with <html xmlns="http://www.w3.org/1999/xhtml">
bigtree at DONTSPAM dot 29a dot nl 26-Apr-2005 02:15
Pay attention when loading html that has a different charset than iso-8859-1. Since this method does not actively try to figure out what the html you are trying to load is encoded in (like most browsers do), you have to specify it in the html head. If, for instance, your html is in utf-8, make sure you have a meta tag in the html's head section:

<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
</head>

If you do not specify the charset like this, all high-ascii bytes will be html-encoded. It is not enough to set the dom document you are loading the html in to UTF-8.

 
show source | credits | stats | sitemap | contact | advertising | mirror sites