Importing special characters failing
-
Hi,
I’m trying to update some import-atom code I found on the Internet, in order to be able to import atom files from an AOL Journals blog. It now does everything correctly except that all the special characters are not coming over cleanly. As you can see in the script I started trying to add slashes but the number of special characters used is large (I think the original texts were written in word and pasted into the blog).The feed to be imported begins with
<?xml version=”1.0″ encoding=”utf-8″ ?>This is the first ever mysql and php I have ever done, so now I’m really at a loss. Hope someone can help
HB<?php // based on import-rss.php // all changes gpl licensed and 2005 (C) havard@dahle.no // Include ezSQL core include_once "../../../ezsql/shared/ez_sql_core.php"; // Include ezSQL database specific component (in this case mySQL) include_once "../../../ezsql/mysql/ez_sql_mysql.php"; // Initialise database object and establish a connection // at the same time - db_user / db_password / db_name / db_host $wpdb = new ezSQL_mysql('wordpress','blog77','wordpress','localhost'); // Example: define('ATOMFILE', '/home/simon/atom.xml'); // or if it's in the same directory as aolatom.php // define('ATOMFILE', 'atom.xml'); // or somewhere online (NOTE: This requires 'allow_url_fopen=1' in php.ini!) // see http://www.php.net/manual/en/ref.filesystem.php#ini.allow-url-fopen // define('ATOMFILE', 'http://myfunkyblog.blogspot.com/atom.xml'); $timezone_offset = 2; // GMT offset of the posts you're importing function unhtmlentities($string) { // From php.net for < 4.3 compat $trans_tbl = get_html_translation_table(HTML_ENTITIES); $trans_tbl = array_flip($trans_tbl); return strtr($string, $trans_tbl); } $add_hours = intval($timezone_offset); $add_minutes = intval(60 * ($timezone_offset - $add_hours)); if (!file_exists('/var/www/html/blog/wp-config.php')) die("There doesn't seem to be a wp-config.php file. You must install WordPress before you import any entries."); require('/var/www/html/blog/wp-config.php'); $step = $_GET['step']; if (!$step) $step = 0; ?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <title>WordPress › Import from ATOM</title> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> <style media="screen" type="text/css"> body { font-family: Georgia, "Times New Roman", Times, serif; margin-left: 20%; margin-right: 20%; } #logo { margin: 0; padding: 0; background-image: url(http://wordpress.org/images/logo.png); background-repeat: no-repeat; height: 60px; border-bottom: 4px solid #333; } #logo a { display: block; text-decoration: none; text-indent: -100em; height: 60px; } p { line-height: 140%; } .error { color: red; } .info { color: green; } </style> </head><body> <h1 id="logo"><a href="http://wordpress.org/">WordPress</a></h1> <?php switch($step) { case 0: ?> <p>Howdy! This importer allows you to extract posts from any ATOM 0.3 file into your blog. This is useful if you want to import your posts from a system that is not handled by a custom import tool. To get started you must edit the following line in this file (<code>aolatom.php</code>) </p> <p><code>define('ATOMFILE', '');</code></p> <p>You want to define where the ATOM file we'll be working with is, for example: </p> <p><code>define('ATOMFILE', 'atom.xml');</code></p> <p>You have to do this manually for security reasons. When you're done reload this page and we'll take you to the next step.</p> <?php if (defined('ATOMFILE')) : ?> <p class="info">All right! I sense that you've already done that. Let's start getting those posts from <code><?php echo ATOMFILE; ?></code>!</p> <h2 style="text-align: right;"><a href="aolatom.php?step=1">Begin ATOM Import »</a></h2> <?php endif; ?> <?php break; case 1: ?> <form action="aolatom.php" method="get"> <p>So, who shall <strong>own the posts</strong> from the atom feed? <select name="owner"> <?php //$users = $wpdb->get_results("SELECT ID FROM wp_users WHERE user_level > 0 ORDER BY ID"); $users = $wpdb->get_results("SELECT * FROM wp_users ORDER BY ID"); //$wpdb->vardump($users); foreach ($users as $user) { //$user_data = get_userdata($user->ID); echo "<option value='{$user->ID}'>{$user->user_nicename}</option>\n"; } ?> <option value="-1">Create new user: "atom user"</option> </select> </p> <p>And <strong>which category</strong> to stuff them in? <select name="category"> <?php $categories = $wpdb->get_results("SELECT * FROM wp_categories"); foreach ($categories as $category) { echo "<option value='{$category->cat_ID}'>{$category->cat_name}</option>\n"; } ?> <option value="-1">Create new category: "atom imported"</option> </select> </p> <p>Right. <input type="submit" value="Let's go!"/> <input type="hidden" name="step" value="2"/> </p> <p class="info">This will <strong>import all posts</strong> from <code><?php echo ATOMFILE; ?></code>! Are you sure you want to do this?</p> </form> <?php break; case 2: // Bring in the data set_magic_quotes_runtime(0); $datalines = file(ATOMFILE); // Read the file into an array $importdata = implode('', $datalines); // squish it $importdata = str_replace(array("\r\n", "\r"), "\n", $importdata); //Does not exist in atom // is at end is pattern modifier i = ignore case, s = ignore end of line preg_match('|<generator[^>]*>(.*?)</generator>|is', $importdata, $generator); $generator = addslashes( trim($generator[1]) ); //Also strip out <![CDATA[...]]> that AOL adds preg_match('|<title[^>]*><\!\[CDATA\[(.*?)\]\]></title>|is', $importdata, $blogname); $blogname = addslashes( trim($blogname[1]) ); echo "Importing blog called: {$blogname}<br /> "; preg_match_all('|<entry[^>]+>(.*?)</entry>|is', $importdata, $posts); $posts = $posts[1]; if(!$posts) die(sprintf("Yikes! I didn't find any posts! Are you sure '%s' is a valid atom feed?", ATOMFILE)); echo '<ol>'; foreach ($posts as $post) : $title = $date = $categories = $content = $post_id = ''; preg_match('|<title.+><\!\[CDATA\[(.*?)\]\]></title>|is', $post, $title); $title = addslashes( trim($title[1]) ); $post_name = sanitize_title($title); echo "<li>Importing post: <strong>{$post_name}</strong> \n"; //Get the date preg_match('|<published>(.*?)</published>|is', $post, $date); if ($date) : $date = str_replace('T', ' ', $date); $date = strtotime($date[1]); else : // if we don't already have something from created $date = strtotime("now"); endif; $post_date = gmdate('Y-m-d H:i:s', $date); echo "posted <strong>{$post_date}</strong> "; $category = $_GET['category']; if(!$category) $category = get_option('default_category'); if($category == -1) { // create new category for imported posts $cat_name = wp_specialchars("Atom imported"); //check if this category already exists in database $category = $wpdb->get_var("SELECT cat_ID FROM $wpdb->categories WHERE cat_name = '$cat_name'"); if(!$category) { $id_result = $wpdb->get_row("SHOW TABLE STATUS LIKE '$wpdb->categories'"); $cat_ID = $id_result->Auto_increment; $category_nicename = sanitize_title($cat_name, $cat_ID); $category_description = sprintf("Imported from atom feed at %s (%s) on %s", ATOMFILE, $generator, gmdate('Y-m-d H:i:s')); $wpdb->query("INSERT INTO $wpdb->categories (cat_ID, cat_name, category_nicename, category_description) VALUES ('$cat_ID', '$cat_name', '$category_nicename', '$category_description')"); $category = $cat_ID; } } echo "category <strong>{$wpdb->get_var("SELECT cat_name FROM $wpdb->categories WHERE cat_ID = '$category'")}</strong> "; preg_match('|<name>(.*?)</name>|is', $post, $poster_name); if ($poster_name) { $poster_name = addslashes( trim($poster_name[1]) ); } else $poster_name = ''; $owner = $_GET['owner']; if(!$owner) $owner = 1; //default = admin! if($owner != -1) { $post_author = $owner; } else { //create new owner for imported posts $owner_name = "AOL user $poster_name"; $post_author = $wpdb->get_var("SELECT ID FROM $wpdb->users WHERE user_lastname = '$owner_name'"); if(!$post_author) { $user_login = wp_specialchars($poster_name); $user_nickname = "atom_user"; $pass = time(); $user_email = ""; $user_firstname = ""; $user_lastname = $owner_name; $now = gmdate('Y-m-d H:i:s'); $default_user_level = 1; $result = $wpdb->query("INSERT INTO $wpdb->users (user_login, user_pass, user_nickname, user_registered, user_level, user_idmode, user_firstname, user_lastname) VALUES ('$user_login', MD5('$pass'), '$user_nickname', '$now', '$default_user_level', 'nickname', '$user_firstname', '$user_lastname' )"); if ($result == false) { die (__("<strong>ERROR</strong>: Couldn’t register {$owner_name}!")); } $post_author = $wpdb->get_var("SELECT ID FROM $wpdb->users ORDER BY ID DESC LIMIT 1"); } } echo "by <strong>{$poster_name}</strong> <br />"; //Now the CONTENT preg_match('|<content([^>]+)?>(.*?)</content>|is', $post, $content); //Not clear what unhtmlentities does, but ->escape cleans for mysql insertion //if( false !== strstr( $content[1], 'mode="escaped"' ) ) $content[2] = $wpdb->escape( unhtmlentities( $content[2] ) ); $content = str_replace( array('<![CDATA[', ']]>'), '', trim($content[2]) ); //remove P, SPAN and FONT rubbish //matches <P, [^>] = anything other than >, + means "Match one or more of the preceding expression", ? means "Match zero or one of the preceding expression" $content = preg_replace(array('|<P([^>]+)?>|is', '|</P>|is'), '', $content); $content = preg_replace(array('|<SPAN([^>]+)?>|is', '|</SPAN>|is'), '', $content); $content = preg_replace(array('|<FONT([^>]+)?>|is', '|</FONT>|is'), '', $content); $content = preg_replace('|<BR([^>]+)?>|is', '<br />', $content); // that just leaves unpaired ' - e.g. apostrophes and in can't $content = str_replace( "’", "\'", $content ); $content = str_replace( "–", "-", $content ); $content = str_replace( "€", "EUR", $content ); $content = $wpdb->escape( $content ); //$content = iconv("UTF-8","UTF-8//IGNORE",$content); echo "Content of post: {$content} <br />"; // Clean up html tags $content = preg_replace('|<(/?[A-Z]+)|e', "'<' . strtolower('$1')", $content); $content = str_replace('<br>', '<br />', $content); $content = str_replace('<hr>', '<hr />', $content); // This can mess up on posts with no titles, but checking content is much slower // So we do it as a last resort if ('' == $title) : $dupe = $wpdb->get_var("SELECT ID FROM $wpdb->posts WHERE post_content = '$content' AND post_date = '$post_date'"); else : $dupe = $wpdb->get_var("SELECT ID FROM $wpdb->posts WHERE post_title = '$title' AND post_date = '$post_date'"); endif; // Now lets put it in the DB if ($dupe) : echo '<span class="error">Post already imported</span>'; else : $wpdb->query("INSERT INTO $wpdb->posts (post_author, post_date, post_date_gmt, post_content, post_title, post_status, comment_status, ping_status, post_name, guid) VALUES ('$post_author', '$post_date', DATE_ADD('$post_date', INTERVAL '$add_hours:$add_minutes' HOUR_MINUTE), '$content', '$title', 'publish', '$comment_status', '$ping_status', '$post_name', '$guid')"); $post_id = $wpdb->get_var("SELECT ID FROM $wpdb->posts WHERE post_title = '$title' AND post_date = '$post_date'"); if (!$post_id) : die("couldn't get post ID"); else : echo "Post allocated ID: {$post_id} <br />"; endif; echo '<span class="info">Done!</span></li>'; $wpdb->query("INSERT INTO $wpdb->post2cat (post_id, category_id) VALUES ($post_id, $category)"); flush(); endif; endforeach; ?> </ol> <h3>All done. <a href="../">Have fun!</a></h3> <?php break; } ?> </body> </html>
- The topic ‘Importing special characters failing’ is closed to new replies.