WordPress.org

Ready to get started?Download WordPress

Forums

Regular Expression Not Working on Live site but is on localhost install (1 post)

  1. Strictly Software
    Member
    Posted 2 years ago #

    I have a custom built version of WP-O-MATIC (as its not supported anymore) that on certain feed imports goes the source URL (where the content actually is e.g the XML Content field is just a pointer to the site/URL where the content actually is rather than in the feed itself).

    So I get the URL from the feed then screen scrape the actual URL source page to get the content I want - I am getting 200 OK status codes as I don't hammer the sites in question. No 403, 400, 404 or redirect codes (301,302 etc) just 200 OK codes and the full source code I want back.

    So the code works in the majority of cases in that it loads the XML, finds the URL, scrapes the content using inbuilt WordPress functions e.g wp_get, finds the article body using regular expressions that get the content I want for my post and saves it.

    However since the last WordPress core update 3.3.2 for one particular site I am scraping the regex desn't seem to be working. I have even reverted back to the old 3.3.1 codebase but still the problem exist so I am not sure whether there were database changes in the update that have caused issues that were not reverted when I changed the codebase files back to 3.3.1.

    I have a local WAMPServer setup st home and my test page works fine, finds the content and extracts it as it's supposed to do and used to on earlier WP versions. This code has been working for 2 years at least.

    The regular expression that is broken is below.

    preg_match("@^[\S\s]+<h1\s+class=['\"]singlePageTitle['\"]>[\S\s]+?</h2>([\S\s]+?)\S+<div class=['\"]clear['\"]><\/div>@ui",$content,$match);
    
    	if($match){
    
    		//We got our content in between entry_img and div break tags
    
    		$filtered = trim($match[1]);
    	}

    I have also tried other regular expressions which all work on my local WAMPServer setup but NOT on live e.g

    preg_match("@^[\S\s]+<h1 class=['\"]singlePageTitle['\"]>[\S\s]+?</h2>([\S\s]+?)\S+</div>[\S\s]+$@i",$content,$match);

    And I even tried a replace e.g replacing all content around the text I want with nothing to just leave the content i want e.g

    $filtered = preg_replace("@(^[\S\s]+<h1 class=['\"]singlePageTitle['\"]>[\S\s]+?</h2>)([\S\s]+?)(\S+</div>[\S\s]+$)@","$1",$content);

    All these methods work on my test setup on my Windows 64 bit 8GB RAM Windows 7 PC WAMPSever but not on my Rackspace cloud hosted LAMP setup with the latest version of MySQL, PHP and all packages.

    It used to work, my code is exactly the same as it was and always has been but now it's not working. When WordPress debug is on it just gets to the RegEx and it says no characters returned e.g count($matches) == 0.

    I don't know if its a UTF-8 problem as the site I scrape from has <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> and the Response headers are UTF-8.
    I have tried removing all UTF-8 characters before running any RegEx but no luck.

    I don't know what else to do and have had issues like this before where regex doesn't work on live but does on my "demo" / "dev" box for no reason.

    No errors in the log files or on the page and I have tried using the /u modifier for UTF-8 as well EVEN after removing al UTF-8 chars.

    At the moment it happens if I "force" a fetch from admin OR run it from WebMin by a CRON WGet job.

    Everything else seems to work ok.

    What can I try OR how can I debug this?

    Any help on this matter would be much appreciated thanks!

Topic Closed

This topic has been closed to new replies.

About this Topic

  • RSS feed for this topic
  • Started 2 years ago by Strictly Software
  • This topic is not resolved
  • WordPress version: 3.3.2