• Resolved Matt

    (@syntax53)


    I just spent a bunch of time sort of “overhauling” wp-media-cleaner’s scanning and detection.

    I wrote a plugin recently for indexing PDFs and for the site I wrote it for I only wanted to index PDFs that were deemed as being attached. So I wrote a bunch of detection methods for finding files within content, shortcode, widgets, etc. I found wp-media-cleaner often missed some of this content so I took the time to apply what I had done with my plugin to yours. I also found and fixed a couple other issues along the way.

    Besides finding content that it was reported as unused, the other big issue I wanted to address was most of your matches were using partial file names. e.g. If I file was “uploads/path/image.png”, in a few places you were only searching “image.png.” That would obviously match that filename if it were in any path on the server, or worse, even a file linked so an external host. I came up with the following regex for my plugin and implemented that here:

    $regex = addcslashes('=[\'"](?:(?:http(?:s)?\\:)?//'.preg_quote($parsedURL['host']).')?(?:'.preg_quote($parsedURL['path']).'/)'.$regex_match_file.'(?:\\?[^\'"]*)*[\'"]', '/');

    Using that regex, the code will match all of the following:

    href=”https://www.host.com/uploads/path/image.png”
    href=”http://www.host.com/uploads/path/image.png?somevar=1″
    href=’http://www.host.com/uploads/path/image.png’
    href=”//www.host.com/uploads/path/image.png”
    href=”/uploads/path/image.png”
    src=”/uploads/path/image.png”

    But it would not match…

    href=”http://www.someotherhost.com/uploads/path/image.png”
    href=”http://www.host.com/uploads/other_path/path/image.png”

    The regex is adapted for matching against metal fields which wont be wrapped in quotes:

    $regex = addcslashes('(?:(?:(?:http(?:s)?\\:)?//'.preg_quote($parsedURL['host']).')?(?:'.preg_quote($parsedURL['path']).'/)|^)'.$regex_match_file, '/');

    Which will match…

    https://www.host.com/uploads/path/image.png
    http://www.host.com/uploads/path/image.png
    /uploads/path/image.png
    path/image.png

    etc.

    Here is a file compare. http://www.diffnow.com/?report=58be2

    And here is the file: https://www.dropbox.com/s/gkfdud7jye2zhiz/wp-media-cleaner.php?dl=0

    function wpmc_wp_ajax_wpmc_scan_do
    102: I can’t stand line like this without braces 🙂 …but I really the only reason I put them in was so I can add some debugging lines while I worked.

    function wpmc_get_galleries_images
    172: 12 hours seemed rather long for the transient

    function wpmc_wp_ajax_wpmc_scan
    183: This is my new transient for the post_content with shortcode extracted

    193-203: This set of code eliminates some double scanning the process was doing. I didn’t see a reason so scan files that were also going to be scanned from the media side as well. So I first store the IDs and Paths in an array (one of which gets passed back for the json) and the then they are used to eliminated from the $files array media that will be scanned anyway.

    205-218: I also felt like there is no need to scan all of the alternate image sizes as long as the source file existed in the database.

    function wpmc_delete
    318: Just an extra check to make sure we aren’t deleting a local the file the actually has a media db entry. Earlier when I was testing this happened. Could have been caused by something else but I figured worth leaving in.

    function wpmc_check_is_ignore
    440: From my understanding this is the way to properly escape-like in wordpress. Also, you didn’t have deleted = 0 in the query before. So it would ignore a new file that had the same path as an old one that was trashed.

    function wpmc_check_in_gallery
    475-482: added wpmc_clean_uploaded_filename here so we can pass the full path and get better matching

    function wpmc_check_db_has_featured
    487: wpmc_check_db_has_featured — There is no need for wpmc_check_db_has_featured. I realized this after modifying. I lef the code there for your review, but wpmc_check_db_has_meta probably isn’t any less expensive and covers the same thing.

    function wpmc_check_db_has_meta
    503: Along with passing the full path into this function I am regexing the match, making sure it’s not matching against it’s own meta_data by also passing in the attachment_id (when available) and also added meta_value = attachment_id to catch some other forms of using featured images and some other plugins.

    function wpmc_check_db_has_content
    519: So this is where the motherload of changes is… So the first thing I do is query post content via the more stricter regex match. If the file is found then great. Move on. If not, then we pull all posts that contain all of the applicable shortcode, loop through them processing the do_shortcode, and then store the results in a transient. I do the same thing with all active widgets and store it along with the post content. Once the transient is stored, we then loop through it checking for matches with the regex.

    function wpmc_find_attachment_id_by_file
    598: Added function. Name describes it all. Used to get an attachment ID from a file path.

    function wpmc_check_file
    611: This is a fix. You were doing strip slashes for $filepath but the $path was being used in other places. Most importantly storing it in your issues table. This was breaking detection for files with single quotes in their name.

    638: First we check the ignore table. If it’s there just quit.

    640: Get the attachment id (used later) and then determine by the extension and with another bit a regex if this is an alternate image size. If the source file does not have a media id or the source file does not exist then mark it as “no media.” Otherwise I feel there is no reason to bother checking each of these individually. When the source file itself is scanned we will check the alternate image sizes then. NOTE: This is kind of irrelevant now because I coded this before I made the modifications to wpmc_wp_ajax_wpmc_scan where I eliminate the image sizes from the scanning arrays. So in theory, this should never happen anyway now.

    655: This is another fix. I added wpmc_check_db_has_meta here and moved wpmc_check_db_has_content (in order of most expensive to least). You had wpmc_check_db_has_meta after this and I couldn’t quite figure out why as by doing so a file would never be marked as “NO_CONTENT” since it wouldn’t get by this and to the insert statement without being marked as “NO_MEDIA”.

    666: To check alternate image sizes of files I use the glob function. I think maybe this should may be employed into check_media in addition to or instead of using wpmc_get_image_sizes. I didn’t find a reason to for now as all of my images checked out. But something to consider.

    function wpmc_check_media
    740 (& 776): Check to see if the file exists to check for orphaned media. If it doesn’t exist I add a new issue “ORPHAN_MEDIA”.

    747 & 766: Add wpmc_check_db_has_meta to this and also move wpmc_check_db_has_content last for speed.

    function echo_issue
    821: Added new issue

    function wpmc_screen
    813: Another fix: weren’t htmlspecialchar escaping the file itself. Only the path before it. Broke files with single quotes in them.

    https://wordpress.org/plugins/wp-media-cleaner/

Viewing 6 replies - 1 through 6 (of 6 total)
  • Plugin Author Jordy Meow

    (@tigroumeow)

    Thanks SO MUCH for that. I will look into it tomorrow or after tomorrow but I am going to apply those changes to my local, read carefully what you wrote and test it. Let me come back to you 🙂

    Thread Starter Matt

    (@syntax53)

    After reviewing my changes again with a fresh head I had some minor improvements:

    Updated file: https://www.dropbox.com/s/gkfdud7jye2zhiz/wp-media-cleaner.php?dl=0

    File Compare: http://www.diffnow.com/?report=sgopn

    Changes:

    function wpmc_wp_ajax_wpmc_scan
    188: Added braces. Again I don’t like not using them on multi-lines.

    205: I moved the part about removing alternate image sizes into the “if ( wpmc_getoption( ‘scan_media’, ‘wpmc_basics’, true ) ) {” section since it is only relevant when scanning media. I suppose it could be wrapped with “if ( wpmc_getoption( ‘scan_files’, ‘wpmc_basics’, true ) ) {” as well, or have the whole thing moved down and then wrapped with both. But $files will be empty if not scanning files anyway so it’s negligible.

    function wpmc_delete( $id ) {
    313: I realized that since we are inserted into the database with “path/filename.ext (+ add files)” that the ” (+ X files)” needs to be stripped off in order to check for a media DB and to delete the files correctly. Also since we are stripslashing before inserting into the database now the stripslashes would be redundant here and could potentially mess up the filename, though I think there should never be backslashes in the filename at this stage anyway? So maybe leave it in, shrug.

    NOTE: Because of these changes with the stripslashing of the filename, people should probably be informed to do a “reset” for scanning again.

    function wpmc_check_db_has_meta
    512: What I had initially done with “if (empty($attachment_id)) $attachment_id = -1;” was sloppy. I changed it around to be better.

    Thread Starter Matt

    (@syntax53)

    Uhg, sorry to keep posting updates, but I just noticed my regex was wrong in the last update. Line 317, instead of:

    $regex = addcslashes(“^(“.preg_quote($issue->path).”)(.*)”, ‘/’);

    should be this instead:

    $regex = “^(.*)(\\s\\(\\+.*)$”;

    Updated file: https://www.dropbox.com/s/gkfdud7jye2zhiz/wp-media-cleaner.php?dl=0

    Plugin Author Jordy Meow

    (@tigroumeow)

    Read, tested, it really looks good. Only thing that I modified it’s a call to empty which included a function (we cannot do that) and also I had cross site scripting vulnerability (my fault). Let’s see how it goes for everyone… 🙂

    Even though that’s nothing much, I added you in the header of the plugin. Thanks so much, Matt, it’s gold to have people like you around.

    Thread Starter Matt

    (@syntax53)

    Thanks for the props, much appreciated. What happened to the url for the plugin? Saying it doesn’t exists now.

    Plugin Author Jordy Meow

    (@tigroumeow)

    From now, the most recent source code will be on GitHub:
    https://github.com/tigroumeow/wp-media-cleaner

    Will make our life easier 🙂 Don’t hesitate to submit pull requests!

    Actually there might be a new bug after your changes, but not sure where yet. A file of mine has been detected as un-used but it actually is. I try to delete it, for testing purposes, and it wasn’t deleted (“Issue listed as filesystem but media db exists”).

Viewing 6 replies - 1 through 6 (of 6 total)
  • The topic ‘Overhaul of file scanning and detection’ is closed to new replies.