• Hi,
    I’m trying to update some import-atom code I found on the Internet, in order to be able to import atom files from an AOL Journals blog. It now does everything correctly except that all the special characters are not coming over cleanly. As you can see in the script I started trying to add slashes but the number of special characters used is large (I think the original texts were written in word and pasted into the blog).

    The feed to be imported begins with
    <?xml version=”1.0″ encoding=”utf-8″ ?>

    This is the first ever mysql and php I have ever done, so now I’m really at a loss. Hope someone can help
    HB

    <?php
    // based on import-rss.php
    // all changes gpl licensed and 2005 (C) havard@dahle.no
    
    // Include ezSQL core
    include_once "../../../ezsql/shared/ez_sql_core.php";
    
    // Include ezSQL database specific component (in this case mySQL)
    include_once "../../../ezsql/mysql/ez_sql_mysql.php";
    
    // Initialise database object and establish a connection
    // at the same time - db_user / db_password / db_name / db_host
    $wpdb = new ezSQL_mysql('wordpress','blog77','wordpress','localhost');
    
    // Example:
    define('ATOMFILE', '/home/simon/atom.xml');
    // or if it's in the same directory as aolatom.php
    // define('ATOMFILE', 'atom.xml');
    // or somewhere online (NOTE: This requires 'allow_url_fopen=1' in php.ini!)
    // see http://www.php.net/manual/en/ref.filesystem.php#ini.allow-url-fopen
    // define('ATOMFILE', 'http://myfunkyblog.blogspot.com/atom.xml');
    
    $timezone_offset = 2; // GMT offset of the posts you're importing
    
    function unhtmlentities($string) { // From php.net for < 4.3 compat
       $trans_tbl = get_html_translation_table(HTML_ENTITIES);
       $trans_tbl = array_flip($trans_tbl);
       return strtr($string, $trans_tbl);
    }
    
    $add_hours = intval($timezone_offset);
    $add_minutes = intval(60 * ($timezone_offset - $add_hours));
    
    if (!file_exists('/var/www/html/blog/wp-config.php')) die("There doesn't seem to be a wp-config.php file. You must install WordPress before you import any entries.");
    require('/var/www/html/blog/wp-config.php');
    
    $step = $_GET['step'];
    if (!$step) $step = 0;
    ?>
    <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
    <html xmlns="http://www.w3.org/1999/xhtml">
    <title>WordPress &rsaquo; Import from ATOM</title>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
    <style media="screen" type="text/css">
        body {
            font-family: Georgia, "Times New Roman", Times, serif;
            margin-left: 20%;
            margin-right: 20%;
        }
        #logo {
            margin: 0;
            padding: 0;
            background-image: url(http://wordpress.org/images/logo.png);
            background-repeat: no-repeat;
            height: 60px;
            border-bottom: 4px solid #333;
        }
        #logo a {
            display: block;
            text-decoration: none;
            text-indent: -100em;
            height: 60px;
        }
        p {
            line-height: 140%;
        }
        .error { color: red; }
        .info  { color: green; }
        </style>
    </head><body>
    <h1 id="logo"><a href="http://wordpress.org/">WordPress</a></h1>
    <?php
    switch($step) {
    
        case 0:
    ?>
    <p>Howdy! This importer allows you to extract posts from any ATOM 0.3 file into your blog. This is useful if you want to import your posts from a system that is not handled by a custom import tool. To get started you must edit the following line in this file (<code>aolatom.php</code>) </p>
    <p><code>define('ATOMFILE', '');</code></p>
    <p>You want to define where the ATOM file we'll be working with is, for example: </p>
    <p><code>define('ATOMFILE', 'atom.xml');</code></p>
    <p>You have to do this manually for security reasons. When you're done reload this page and we'll take you to the next step.</p>
    <?php if (defined('ATOMFILE')) : ?>
    <p class="info">All right! I sense that you've already done that. Let's start getting those posts
    from <code><?php echo ATOMFILE; ?></code>!</p>
    <h2 style="text-align: right;"><a href="aolatom.php?step=1">Begin ATOM Import &raquo;</a></h2>
    <?php endif; ?>
    <?php
        break;
    
        case 1:
    
    ?>
    <form action="aolatom.php" method="get">
    <p>So, who shall <strong>own the posts</strong> from the atom feed?
      <select name="owner">
      <?php
        //$users = $wpdb->get_results("SELECT ID FROM wp_users WHERE user_level > 0 ORDER BY ID");
        $users = $wpdb->get_results("SELECT * FROM wp_users ORDER BY ID");
    	//$wpdb->vardump($users);
        foreach ($users as $user) {
            //$user_data = get_userdata($user->ID);
            echo "<option value='{$user->ID}'>{$user->user_nicename}</option>\n";
        }
      ?>
        <option value="-1">Create new user: "atom user"</option>
      </select>
    </p>
    
    <p>And <strong>which category</strong> to stuff them in?
      <select name="category">
          <?php
          $categories = $wpdb->get_results("SELECT * FROM wp_categories");
          foreach ($categories as $category) {
            echo "<option value='{$category->cat_ID}'>{$category->cat_name}</option>\n";
          }
          ?>
        <option value="-1">Create new category: "atom imported"</option>
      </select>
    </p>
    
    <p>Right. <input type="submit" value="Let's go!"/>
    <input type="hidden" name="step" value="2"/>
    </p>
    <p class="info">This will <strong>import all posts</strong> from <code><?php echo ATOMFILE; ?></code>! Are you
    sure you want to do this?</p>
    </form>
    
    <?php
    
        break;
    
        case 2:
    
    // Bring in the data
    set_magic_quotes_runtime(0);
    $datalines = file(ATOMFILE); // Read the file into an array
    $importdata = implode('', $datalines); // squish it
    $importdata = str_replace(array("\r\n", "\r"), "\n", $importdata);
    
    //Does not exist in atom
    // is at end is pattern modifier i = ignore case, s = ignore end of line
    preg_match('|<generator[^>]*>(.*?)</generator>|is', $importdata, $generator);
    $generator = addslashes( trim($generator[1]) );
    
    //Also strip out <![CDATA[...]]> that AOL adds
    preg_match('|<title[^>]*><\!\[CDATA\[(.*?)\]\]></title>|is', $importdata, $blogname);
    $blogname = addslashes( trim($blogname[1]) );
    echo "Importing blog called: {$blogname}<br /> ";
    
    preg_match_all('|<entry[^>]+>(.*?)</entry>|is', $importdata, $posts);
    $posts = $posts[1];
    
    if(!$posts) die(sprintf("Yikes! I didn't find any posts! Are you sure
    '%s' is a valid atom feed?", ATOMFILE));
    
    echo '<ol>';
    foreach ($posts as $post) :
    	$title = $date = $categories = $content = $post_id =  '';
    
    	preg_match('|<title.+><\!\[CDATA\[(.*?)\]\]></title>|is', $post, $title);
    	$title = addslashes( trim($title[1]) );
    	$post_name = sanitize_title($title);
    	echo "<li>Importing post: <strong>{$post_name}</strong> \n";
    
    	//Get the date
    	preg_match('|<published>(.*?)</published>|is', $post, $date);
    	if ($date) :
    		$date = str_replace('T', ' ', $date);
    		$date = strtotime($date[1]);
    	else : // if we don't already have something from created
    		$date = strtotime("now");
    	endif;
    	$post_date = gmdate('Y-m-d H:i:s', $date);
    	echo "posted <strong>{$post_date}</strong> ";
    
    	$category = $_GET['category'];
    	if(!$category) $category = get_option('default_category');
    	if($category == -1) { // create new category for imported posts
    
    		$cat_name = wp_specialchars("Atom imported");
    		//check if this category already exists in database
    		$category = $wpdb->get_var("SELECT cat_ID FROM $wpdb->categories WHERE cat_name = '$cat_name'");
    
    		if(!$category) {
    
    			$id_result = $wpdb->get_row("SHOW TABLE STATUS LIKE '$wpdb->categories'");
    			$cat_ID = $id_result->Auto_increment;
    			$category_nicename = sanitize_title($cat_name, $cat_ID);
    			$category_description = sprintf("Imported from atom feed at %s (%s) on %s", ATOMFILE, $generator, gmdate('Y-m-d H:i:s'));
    
    			$wpdb->query("INSERT INTO $wpdb->categories (cat_ID, cat_name, category_nicename,
    			category_description) VALUES ('$cat_ID', '$cat_name', '$category_nicename', '$category_description')");
    			$category = $cat_ID;
    		}
    	}
    	echo "category <strong>{$wpdb->get_var("SELECT cat_name FROM $wpdb->categories WHERE cat_ID = '$category'")}</strong> ";
    
    	preg_match('|<name>(.*?)</name>|is', $post, $poster_name);
    	if ($poster_name) {
    		$poster_name = addslashes( trim($poster_name[1]) );
    	} else $poster_name = '';
    
    	$owner = $_GET['owner'];
    	if(!$owner) $owner = 1;  //default = admin!
    
    	if($owner != -1) {
    		$post_author = $owner;
    	} else { //create new owner for imported posts
    		$owner_name = "AOL user $poster_name";
    
    		$post_author = $wpdb->get_var("SELECT ID FROM $wpdb->users WHERE user_lastname = '$owner_name'");
    
    		if(!$post_author) {
    
    			$user_login     = wp_specialchars($poster_name);
    			$user_nickname  = "atom_user";
    			$pass           = time();
    			$user_email     = "";
    			$user_firstname = "";
    			$user_lastname  = $owner_name;
    
    			$now = gmdate('Y-m-d H:i:s');
    			$default_user_level = 1;
    
    			$result = $wpdb->query("INSERT INTO $wpdb->users
    				(user_login, user_pass, user_nickname, user_registered, user_level, user_idmode,
    				user_firstname, user_lastname)
    				VALUES
    				('$user_login', MD5('$pass'), '$user_nickname', '$now', '$default_user_level',
    				'nickname', '$user_firstname', '$user_lastname' )");
    
    			if ($result == false) {
    			die (__("<strong>ERROR</strong>: Couldn’t register {$owner_name}!"));
    			}
    			$post_author = $wpdb->get_var("SELECT ID FROM $wpdb->users ORDER BY ID DESC LIMIT 1");
    		}
    	}
    	echo "by <strong>{$poster_name}</strong> <br />";
    
    	//Now the CONTENT
    	preg_match('|<content([^>]+)?>(.*?)</content>|is', $post, $content);
    
    	//Not clear what unhtmlentities does, but ->escape cleans for mysql insertion
    	//if( false !== strstr( $content[1], 'mode="escaped"' ) ) $content[2] = $wpdb->escape( unhtmlentities( $content[2] ) );
    
    	$content = str_replace( array('<![CDATA[', ']]>'), '',  trim($content[2]) );
    	//remove P, SPAN and FONT rubbish
    	//matches <P, [^>] = anything other than >, + means "Match one or more of the preceding expression", ? means "Match zero or one of the preceding expression"
    	$content = preg_replace(array('|<P([^>]+)?>|is', '|</P>|is'), '', $content);
    	$content = preg_replace(array('|<SPAN([^>]+)?>|is', '|</SPAN>|is'), '', $content);
    	$content = preg_replace(array('|<FONT([^>]+)?>|is', '|</FONT>|is'), '', $content);
    	$content = preg_replace('|<BR([^>]+)?>|is', '<br />', $content);
    	// that just leaves unpaired ' - e.g. apostrophes and in can't
    	$content = str_replace( "’", "\'",  $content );
    	$content = str_replace( "–", "-",  $content );
    	$content = str_replace( "€", "EUR",  $content );
    	$content = $wpdb->escape( $content );
    	 //$content = iconv("UTF-8","UTF-8//IGNORE",$content);
    
    	echo "Content of post: {$content} <br />";
    
    	// Clean up html tags
    	$content = preg_replace('|<(/?[A-Z]+)|e', "'<' . strtolower('$1')", $content);
    	$content = str_replace('<br>', '<br />', $content);
    	$content = str_replace('<hr>', '<hr />', $content);
    
    	// This can mess up on posts with no titles, but checking content is much slower
    	// So we do it as a last resort
    	if ('' == $title) :
    	$dupe = $wpdb->get_var("SELECT ID FROM $wpdb->posts WHERE post_content = '$content' AND post_date = '$post_date'");
    	else :
    	$dupe = $wpdb->get_var("SELECT ID FROM $wpdb->posts WHERE post_title = '$title' AND post_date = '$post_date'");
    	endif;
    
    	// Now lets put it in the DB
    	if ($dupe) :
    		echo '<span class="error">Post already imported</span>';
    	else :
    		$wpdb->query("INSERT INTO $wpdb->posts
    			(post_author, post_date, post_date_gmt, post_content, post_title, post_status, comment_status, ping_status, post_name, guid)
    			VALUES
    			('$post_author', '$post_date', DATE_ADD('$post_date', INTERVAL '$add_hours:$add_minutes' HOUR_MINUTE), '$content', '$title', 'publish', '$comment_status', '$ping_status', '$post_name', '$guid')");
    
    		$post_id = $wpdb->get_var("SELECT ID FROM $wpdb->posts WHERE post_title = '$title' AND post_date = '$post_date'");
    		if (!$post_id) : die("couldn't get post ID");
    		else : echo "Post allocated ID: {$post_id} <br />";
    		endif;
    
    		echo '<span class="info">Done!</span></li>';
    
    		$wpdb->query("INSERT INTO $wpdb->post2cat (post_id, category_id) VALUES ($post_id, $category)");
    		flush();
    	endif;
    
    endforeach;
    ?>
    </ol>
    
    <h3>All done. <a href="../">Have fun!</a></h3>
    <?php
        break;
    }
    ?>
    </body>
    </html>
  • The topic ‘Importing special characters failing’ is closed to new replies.