Support » Plugin: Docs to WordPress » ISO Matt's patches to extend-clean.php

  • Resolved piantadosi

    (@piantadosi)


    A year ago, in this closed thread, @mattdawsonuk mentioned running a modified version of the Docs-to-WP plugin’s extend-clean.php which stripped out Google’s tracking from hyperlinks.

    Matt, if you’re out there, I’m gratefully using your other DTWP patch, which retains text formatting, bold and italic, but would love to know how to remove the Google tracking from Google Doc hyperlinks before they reach WordPress.

    Thanks!

    Roger

    https://wordpress.org/plugins/docs-to-wordpress/

Viewing 12 replies - 1 through 12 (of 12 total)
  • mattdawsonuk

    (@mattdawsonuk)

    Hi Roger

    Just got your message about the tracking links. You are in look! I was able to add in some regex replacements to remove the Google tracking. I’ll send over a code snippet when I’m back at my desk on Monday. I should also send in a pull request to get the bold/italic fix merged in with the core!

    Matt

    traceyiaizzi

    (@traceyiaizzi)

    Matt,
    My company is using your Docs to WordPress Add on (Chrome and Firefox, not the WordPress plugin) extensively. We are a local news organization.

    Suddenly within that past several days it is failing to load so a growing number of the 70 people using it are unable to use it. I am at a loss as to why.

    Are you hearing of this problem from others?

    Mac OSX 10.10.5, Chrome and Firefox.

    Thanks, Tracey

    piantadosi

    (@piantadosi)

    Many thanks to you, Matt — and, in other support threads, @tararebeka — for your help getting the plug-in working again and definitely making this a way luckier day than iIthought it would be when it started.

    @traceylaizzi, that Google Docs add-on died without warning on July 1, and the developer’s website is also gone. we are also a local news organization and we had also been using it, or trying to, for the last year. I could be wrong but I don’t believe Matt has anything to do with the browser-based add-on. However I wanted to confirm that you are not alone — and also that this WP plug-in works, and way more reliably, than the add-on ever did.

    Roger

    traceyiaizzi

    (@traceyiaizzi)

    Thank you Roger!

    Will check out this plugin. Good to know we are not alone. Nothing like a complete breakdown of your editorial workflow to welcome a person back from vacation 😉

    Tracey

    mattdawsonuk

    (@mattdawsonuk)

    @piantadosi I’ve had a busy couple of days and forgot to send you my code! I’ll send it you in the morning.
    Matt

    mattdawsonuk

    (@mattdawsonuk)

    To remove the google tracking from urls i have this added to extend-clean.php (Inserted at line 71)

    //remove google tracking links - mattd
    $post_content = str_replace('https://www.google.com/url?q=', '', $post_content);
    $post_content = str_replace('http://www.google.com/url?q=', '', $post_content);
    $post_content = preg_replace('/&sa=D&sntz=1&usg=(.*?)\">/', '">', $post_content);
    $post_content = preg_replace('/&sa=D&usg=(.*?)\">/', '">', $post_content);
    $post_content = preg_replace('/&sa=D&ust=(.*?)&usg=(.*?)\">/', '">', $post_content);
    $post_content = str_replace('%3A', ':', $post_content);
    $post_content = str_replace('%2F', '/', $post_content);
    $post_content = str_replace('%3F', '?', $post_content);
    $post_content = str_replace('%3D', '=', $post_content);

    I’m sure this can be reduced to a single replacement as i have done this in a python app that utilised the DriveAPI. However i don’t have time to change and test it. There are two lines that look similar because Google change the construct of their tracking links about 12 months ago.

    Hope this helps.

    Matt

    mattdawsonuk

    (@mattdawsonuk)

    And incase this helps, here are some of the other changes i have made (inserted after the google tracking removal code):

    1. I needed support for tables so i have added to the allowed tags on line 74. Mine looks like this (note the addition of the table, td and th tags):

    $post_content = strip_tags($post_content, '<strong><b><i><em><a><u><br><p><ol><ul><li><h1><h2><h3><h4><h5><h6><table><tr><td><th>' );

    2. I was getting google classes pulling through in some elements so i also run this:

    //Remove classes in elements - mattd
    $post_content = preg_replace('/<p(.*?)>/', '<p>', $post_content);
    $post_content = preg_replace('/<h1(.*?)>/', '<h1>', $post_content);
    $post_content = preg_replace('/<h2(.*?)>/', '<h2>', $post_content);
    $post_content = preg_replace('/<h3(.*?)>/', '<h3>', $post_content);
    $post_content = preg_replace('/<h4(.*?)>/', '<h4>', $post_content);
    $post_content = preg_replace('/<h5(.*?)>/', '<h5>', $post_content);
    $post_content = preg_replace('/<h6(.*?)>/', '<h6>', $post_content);
    $post_content = preg_replace('/<li(.*?)>/', '<li>', $post_content);
    $post_content = preg_replace('/<ul(.*?)>/', '<ul>', $post_content);
    $post_content = preg_replace('/<a class=\"(.*?)\"/', '<a', $post_content);

    3. To remove any empty tags i run this:

    //Remove empty elements - mattd
    $post_content = preg_replace('/<a><\/a>/', '', $post_content);
    $post_content = preg_replace('/<h1><\/h1>/', '', $post_content);
    $post_content = preg_replace('/<h2><\/h2>/', '', $post_content);
    $post_content = preg_replace('/<h3><\/h3>/', '', $post_content);
    $post_content = preg_replace('/<h4><\/h4>/', '', $post_content);
    $post_content = preg_replace('/<h5><\/h5>/', '', $post_content);
    $post_content = preg_replace('/<div><\/div>/', '', $post_content);

    As you can see, it is easy to set up your own string replacements to make a change to the content during the google to wp conversion.

    piantadosi

    (@piantadosi)

    Thanks very much for all of this, Matt.

    Um, afraid to ask but . . . has anyone else noticed that bold and italic attributes are not being caught? Maybe it’s just me. It’s probably just me. Or maybe Google changed something again. It was working fine two days ago.

    I have double- and triple-checked that Matt’s additional code is indeed there in docs-to-wp.php (we just switched hosts, so I’ve been doing a lot of copying and verifying). Should I check if the PHP version on the new host is new enough? No errors or warnings in the log.

    Also, Matt, if you’re there, I can’t seem to get the Google tracking code removal patch you created to work. It does get rid of the preceding https://www.google.com/url?q= just fine but the “&sa=” and the long tracking code following that are still there, so the base url is correct but it arrives at the destination and tries to execute a variable or find a page that doesn’t exist.

    If anyone else has noticed a change in the way the Drive API is coding text or link attributes, let me know. I’m going to check my horoscope now.

    Roger

    Hi Roger.

    I spotted earlier this week that Google have changed their bold styling to use ‘font-weight:700’ instead of ‘font-weight:bold’.

    So you just need to change ‘font-weight:bold’ to ‘font-weight:700‘ in my fix code and it should sort it.

    I’ll have a test of the tracking link removal code later in the week and get back to you on that.

    Hope that helps.

    Matt

    Hey Matt,

    I made the fix and, as you say over there, Bob’s your uncle.

    Whomever your uncle is, you can tell him I have only good things to say about his nephew Matt.

    Thanks once again.

    Roger

    PS: Unless it’s really complicated to explain to a non-coder like me, let me know if there’s a way that I can see what’s coming over from Google. I tried downloading a few Docs files as rtf and html, but the text formatting was done in a totally different way in the downloaded documents.

    Marking as resolved.

Viewing 12 replies - 1 through 12 (of 12 total)
  • The topic ‘ISO Matt's patches to extend-clean.php’ is closed to new replies.