Use DOMDocument to parse non utf-8 encoding web page in PHP

Recently I was digging around in PHP + curl + DOMDocument, there are quite lot of impressive facilities such as DOMxPath, curl post, cookies, it is very effortless to simulate any action on an website without JavaScript depend. Here is some problem & tricks I found when I handle any non utf-8 encoding with CURL & DOMDocument.

Case 1:
Parsing a non utf-8 encoding page to DomDocument, Some web page put tag in following sequence


Assuming you have just received the html content from curl_exec

//.... $htmlContent = curl_exec($ch); $doc=new DocDocument('1.0', 'ENCODING'); //create a new DOMDocument object $doc->loadHtml($htmlContent); //you probably obtain warning here $doc->save('test.html');

Open your test.html with any text editor, you may find the your html body is gone & the header is incomplete.

To resolve this problem, you will have to put the title after the

Here is a simple trick to do

$htmlContent = curl_exec($ch); $pattern="/(.<em></title>)[.\s]</em>(<meta\s<em>http-equiv="Content-Type"\s</em>content="text/html; charset=gb2312"\s*/>)/i"; $htmlContent=preg_replace($pattern,"$2\r\n$1",$htmlContent); $doc=new DocDocument('1.0', 'ENCODING'); $doc->loadHtml($htmlContent);</p> <p>Now you should obtain the proper document content without lose anything.</p> </div> </section> <script async src="//"></script> <ins class="adsbygoogle" style="display:block; text-align:center;" data-ad-layout="in-article" data-ad-format="fluid" data-ad-client="ca-pub-8336269565288641" data-ad-slot="8434261161"></ins> <script> (adsbygoogle = window.adsbygoogle || []).push({}); </script> <div id="disqus_thread"></div> <script> /** * RECOMMENDED CONFIGURATION VARIABLES: EDIT AND UNCOMMENT THE SECTION BELOW TO INSERT DYNAMIC VALUES FROM YOUR PLATFORM OR CMS. * LEARN WHY DEFINING THESE VARIABLES IS IMPORTANT:*/ var disqus_config = function () { = PAGE_URL; // Replace PAGE_URL with your page's canonical URL variable = "ghost-115"; // Replace PAGE_IDENTIFIER with your page's unique identifier variable }; (function() { // DON'T EDIT BELOW THIS LINE var d = document, s = d.createElement('script'); s.src = ''; s.setAttribute('data-timestamp', +new Date()); (d.head || d.body).appendChild(s); })(); </script> <noscript>Please enable JavaScript to view the <a href="">comments powered by Disqus.</a></noscript> <footer class="post-full-footer"> <section class="author-card"> <section class="author-card-content"> <h4 class="author-card-name"><a href="/author/saga/">saga</a></h4> <p>Read <a href="/author/saga/">more posts</a> by this author.</p> </section> </section> <div class="post-full-footer-right"> <a class="author-card-button" href="/author/saga/">Read More</a> </div> </footer> </article> </div> </main> <aside class="read-next outer"> <div class="inner"> <div class="read-next-feed"> <article class="post-card post no-image"> <div class="post-card-content"> <a class="post-card-content-link" href="/google-tool-bar-doesnt-support-google-chrome-thats-funny/"> <header class="post-card-header"> <h2 class="post-card-title">Google toolbar doesn't support Google Chrome. That's funny</h2> </header> <section class="post-card-excerpt"> <p>First of all, I’m a fan of Google so this article is not going to talk approximately any thing sucks, it is just I found this comical issue that google tool bar</p> </section> </a> <footer class="post-card-meta"> <span class="post-card-author"><a href="/author/saga/">saga</a></span> </footer> </div> </article> <article class="post-card post tag-mangos-2 no-image"> <div class="post-card-content"> <a class="post-card-content-link" href="/full-mangos-command-list/"> <header class="post-card-header"> <span class="post-card-tags">mangos</span> <h2 class="post-card-title">Full mangos command list</h2> </header> <section class="post-card-excerpt"> <p>Original from Really helpful !!! namesecurityhelpaccount0Syntax: .accountDisplay the access level of your account.account characters3Syntax: .account characters [#accountId|$accountName]Show list all characters for account seelcted</p> </section> </a> <footer class="post-card-meta"> <span class="post-card-author"><a href="/author/saga/">saga</a></span> </footer> </div> </article> </div> </div> </aside> <div class="floating-header"> <div class="floating-header-logo"> <a href=""> <span>X-Note</span> </a> </div> <span class="floating-header-divider">—</span> <div class="floating-header-title">Use DOMDocument to parse non utf-8 encoding web page in PHP</div> <div class="floating-header-share"> <div class="floating-header-share-label">Share this <svg xmlns="" viewBox="0 0 24 24"> <path d="M7.5 15.5V4a1.5 1.5 0 1 1 3 0v4.5h2a1 1 0 0 1 1 1h2a1 1 0 0 1 1 1H18a1.5 1.5 0 0 1 1.5 1.5v3.099c0 .929-.13 1.854-.385 2.748L17.5 23.5h-9c-1.5-2-5.417-8.673-5.417-8.673a1.2 1.2 0 0 1 1.76-1.605L7.5 15.5zm6-6v2m-3-3.5v3.5m6-1v2"/> </svg> </div> <a class="floating-header-share-tw" href="" onclick=", 'share-twitter', 'width=550,height=235');return false;"> <svg xmlns="" viewBox="0 0 32 32"><path d="M30.063 7.313c-.813 1.125-1.75 2.125-2.875 2.938v.75c0 1.563-.188 3.125-.688 4.625a15.088 15.088 0 0 1-2.063 4.438c-.875 1.438-2 2.688-3.25 3.813a15.015 15.015 0 0 1-4.625 2.563c-1.813.688-3.75 1-5.75 1-3.25 0-6.188-.875-8.875-2.625.438.063.875.125 1.375.125 2.688 0 5.063-.875 7.188-2.5-1.25 0-2.375-.375-3.375-1.125s-1.688-1.688-2.063-2.875c.438.063.813.125 0 1-.063 1.5-.25-1.313-.25-2.438-.938-3.313-1.938a5.673 5.673 0 0 1-1.313-3.688v-.063c.813.438 1.688.688 2.625.688a5.228 5.228 0 0 1-1.875-2c-.5-.875-.688-1.813-.688-2.75 0-1.063.25-2.063.75-2.938 1.438 1.75 3.188 3.188 5.25 4.25s4.313 1.688 6.688 1.813a5.579 5.579 0 0 1 1.5-5.438c1.125-1.125 2.5-1.688 4.125-1.688s3.063.625 4.188 1.813a11.48 11.48 0 0 0 3.688-1.375c-.438 1.375-1.313 2.438-2.563 3.188 1.125-.125 2.188-.438 3.313-.875z"/></svg> </a> <a class="floating-header-share-fb" href="" onclick=", 'share-facebook','width=580,height=296');return false;"> <svg xmlns="" viewBox="0 0 32 32"><path d="M19 6h5V0h-5c-3.86 0-7 3.14-7 7v3H8v6h4v16h6V16h5l1-6h-6V7c0-.542.458-1 1-1z"/></svg> </a> </div> <progress class="progress" value="0"> <div class="progress-container"> <span class="progress-bar"></span> </div> </progress> </div> <footer class="site-footer outer"> <div class="site-footer-content inner"> <section class="copyright"><a href="">X-Note</a> © 2018</section> <nav class="site-footer-nav"> <a href="">Latest Posts</a> <a href="" target="_blank" rel="noopener">Ghost</a> </nav> </div> </footer> </div> <script src="" integrity="sha256-hwg4gsxgFZhOsEEamdOYGBf13FyQuiTwlAQgxVSNgt4=" crossorigin="anonymous"> </script> <script type="text/javascript" src="/assets/js/jquery.fitvids.js?v=3864ee5481"></script> <script> // NOTE: Scroll performance is poor in Safari // - this appears to be due to the events firing much more slowly in Safari. // Dropping the scroll event and using only a raf loop results in smoother // scrolling but continuous processing even when not scrolling $(document).ready(function () { // Start fitVids var $postContent = $(".post-full-content"); $postContent.fitVids(); // End fitVids var progressBar = document.querySelector('progress'); var header = document.querySelector('.floating-header'); var title = document.querySelector('.post-full-title'); var lastScrollY = window.scrollY; var lastWindowHeight = window.innerHeight; var lastDocumentHeight = $(document).height(); var ticking = false; function onScroll() { lastScrollY = window.scrollY; requestTick(); } function onResize() { lastWindowHeight = window.innerHeight; lastDocumentHeight = $(document).height(); requestTick(); } function requestTick() { if (!ticking) { requestAnimationFrame(update); } ticking = true; } function update() { var trigger = title.getBoundingClientRect().top + window.scrollY; var triggerOffset = title.offsetHeight + 35; var progressMax = lastDocumentHeight - lastWindowHeight; // show/hide floating header if (lastScrollY >= trigger + triggerOffset) { header.classList.add('floating-active'); } else { header.classList.remove('floating-active'); } progressBar.setAttribute('max', progressMax); progressBar.setAttribute('value', lastScrollY); ticking = false; } window.addEventListener('scroll', onScroll, {passive: true}); window.addEventListener('resize', onResize, false); update(); }); </script> <script type="text/javascript" src="" ></script> </body> </html>