• Are there known issues with importing preformatted text?

    Importing the test file

    <html>
    <head><title>Test</title></head>
    <body>
    <pre>
    This is the first line
    This is the second line
    </pre>
    </html>

    with <pre> listed under import settings -> content -> allowed html results in the single unbroken line

    <pre> This is the first line This is the second line </pre>

    Is there a configuration of this plug in that will respect the line breaks within the <pre></pre> tags?

    http://wordpress.org/extend/plugins/import-html-pages/

Viewing 3 replies - 1 through 3 (of 3 total)
  • Thread Starter Mark Tuttle

    (@markrtuttle)

    I propose adding to the HTML_Import class defined in html-importer.php the function

    function strip_insignificant_html_whitespace($string) {
      $pre_start = "<pre(?:>|\\s[^>]*>)";
      $pre_end   = "</pre(?:>|\\s[^>]*>)";
    
      $old_parts = preg_split(";($pre_start|$pre_end);i",$string,0,PREG_SPLIT_DELIM_CAPTURE);
      $new_parts = array();
    
      $strip = true;
      foreach ($old_parts as $part) {
        if (preg_match(";$pre_start;i",$part)) {
          $tmp = preg_replace(";\s+;"," ",$part);
          $new_parts[] = preg_replace("; +>;",">",$tmp);
          $strip = false;
          continue;
        }
        if (preg_match(";$pre_end;i",$part)) {
          $tmp = preg_replace(";\s+;"," ",$part);
          $new_parts[] = preg_replace("; +>;",">",$tmp);
          $strip = true;
          continue;
        }
        if ($strip)
          $new_parts[] = preg_replace(";\s+;"," ",$part);
        else
          $new_parts[] = $part;
      }
      return implode("",$new_parts);
    }

    In clean_html

    replace
      $string = str_replace( '\n', ' ', $string );
    with
      $string = $this->strip_insignificant_html_whitespace($string);

    In get_post in the !empty($my_post['post_content']))

    replace
      $my_post['post_content'] = ereg_replace("[\n\r]", " ", $my_post['post_content']);
    with
      $my_post['post_content'] = $this->strip_insignificant_html_whitespace($my_post['post_content']);

    It would be nice also to strip the contents of cdata blocks and <script>..</script> blocks cleanly. I find examples like

    <div id="googleAds">
      <!-- b e g i n   g o o g l e  a d s  -->
      <script type="text/javascript">
        //<![CDATA[
        <!--
        google_ad_client = "...";
        google_ad_slot = "...";
        google_ad_width = ...;
        google_ad_height = ...;
        //-->
        //]]>
      </script>
      <script type="text/javascript" src="/data/../pagead2.googlesyndication.com/pagead/show_ads.js">
      </script> <!-- e n d   g o o g l e  a d s  -->
    </div>

    that are not stripped cleanly by the application of the php strip_tags function in the plugin.

    Thread Starter Mark Tuttle

    (@markrtuttle)

    To strip the cdata, script, and style blocks, I think it is sufficient to add the functions

    function allowed_tag($tag,$allowedtags=NULL) {
      return
        !is_null($allowedtags) &&
        stripos($allowedtags,$tag) !== false;
    }
    
    function strip_cdata_block($string,$allowedtags=NULL) {
      if ($this->allowed_tag('<cdata>',$allowedtags)) return $string;
    
      $delim = "@";
      $cdata_start = preg_quote('<![CDATA[',$delim);
      $cdata_end = preg_quote(']]>',$delim);
      $block = "$cdata_start.*?$cdata_end";
    
      return preg_replace("${delim}$block${delim}s","",$string);
    }
    
    function strip_tag_block($tag,$string,$allowedtags=NULL) {
      if ($this->allowed_tag($tag,$allowedtags)) return $string;
      if (!preg_match(":<(.*?)>:",$tag,$match)) return $string;
    
      $delim = "@";
      $tag_str = $match[1];
      $tag_start = "<$tag_str(?:>|\\s[^>]*>)";
      $tag_end   = "</$tag_str(?:>|\\s[^>]*>)";
      $block = "$tag_start.*?$tag_end";
    
      return preg_replace("${delim}$block${delim}is","",$string);
    }
    
    function strip_comment_block($string) {
      $delim = "@";
      $comment_start = preg_quote('<!--',$delim);
      $comment_end = preg_quote('-->',$delim);
      $block = "$comment_start.*?$comment_end";
    
      return preg_replace("${delim}$block${delim}s","",$string);
    }

    and add the following calls before strip_tags at the head of clean_html:

    $string = $this->strip_cdata_block($string,$allowtags);
    $string = $this->strip_tag_block('<script>',$string,$allowtags);
    $string = $this->strip_tag_block('<style>',$string,$allowtags);
    $string = $this->strip_comment_block($string);
    Plugin Author Stephanie Leary

    (@sillybean)

    Thanks, Mark! I’ll try to incorporate this into the next version.

Viewing 3 replies - 1 through 3 (of 3 total)
  • The topic ‘[Plugin: HTML Import 2] Importing preformatted text (pre tag)’ is closed to new replies.