Support » Plugin: NinjaScanner - Virus & Malware scan » Several improvement suggestions / ideas

  • Here are some suggestions for improvements regarding memory consumption, performance and so on.

    On some servers I have the issue that I hit some memory limits. This happens also when the zip file of the WordPress core is downloaded (which is probably loaded into memory at once).

    And on servers with many and sometimes big files it may happen, even if the limits for the scan are at the default (file size limit) and even when I add some file extensions to the exclusion list (hangs and fails at “Building file list”).

    20-Nov-20 11:41:41 INFO  Processing step 1/14
    20-Nov-20 11:41:41 INFO  Checking NinjaScanner files integrity
    20-Nov-20 11:41:41 INFO  Files integrity is O.K
    20-Nov-20 11:41:44 INFO  Processing step 2/14
    20-Nov-20 11:41:44 INFO  Building file's list
    20-Nov-20 11:41:45 INFO  Total files found: 99.376
    20-Nov-20 11:41:45 INFO  Building database posts and pages checksum
    20-Nov-20 11:41:45 INFO  Found 0 posts and 51 pages in the database
    20-Nov-20 11:41:50 INFO  Processing step 3/14
    20-Nov-20 11:41:50 INFO  Skipping WordPress core files integrity check
    20-Nov-20 11:41:50 ERROR Error: E_ERROR (Out of memory (allocated 162529280) (tried to allocate 14684160 bytes) - 523)
    20-Nov-20 11:42:01 INFO  Cancelling scanning process
    20-Nov-20 11:42:05 INFO  Cancelling scanning process
    20-Nov-20 11:42:07 INFO  Cancelling scanning process
    20-Nov-20 11:42:07 INFO  Cancelling scanning process

    I have seen that you use file_get_contents() to read the files but it might be better and more performant to reach the files only line by line, using fopen, fgets, fclose and a generator (yield). Also you could use the RecursiveIteratorIterator with RecursiveDirectoryIterator (also with yield). This would lead to a maximum of about 4 – 6 MB per full scan (in most cases).

    https://plugins.trac.wordpress.org/browser/ninjascanner/trunk/lib/scan.php#L2167

    And for the zip file it may be better to stream and append the content from the WordPress SVN, same for files from there (for the file comparison).

    https://plugins.trac.wordpress.org/browser/ninjascanner/trunk/lib/file_compare.php#L190

    Just some ideas. With these solutions (without the zip streaming part) I have built some small tools to scan a whole webspace using FTP or PHP on SSH in the shortest time possible and most performant way possible.

    The end of the zip download could be checked by increasing some byte counter during the download and comparing it with the content-length header value (that cna be retrieved with a HEAD request) and checksum if possible and available.

    curl -I https://wordpress.org/wordpress-5.5.3.zip
    HTTP/2 200 
    server: nginx
    date: Fri, 20 Nov 2020 11:52:58 GMT
    content-type: application/zip
    content-length: 14009358
    cache-control: private
    content-disposition: attachment; filename=wordpress-5.5.3.zip
    last-modified: Fri, 30 Oct 2020 20:41:59 GMT
    x-frame-options: SAMEORIGIN
    content-md5: b7a9eb3560e5e17df79b7131a9fafdd7
    x-nc: HIT ord 2
    accept-ranges: bytes

    See also:
    https://www.sitepoint.com/performant-reading-big-files-php/
    https://www.php.net/manual/de/ziparchive.getstream.php
    https://www.ramielcreations.com/using-streams-in-wordpress-http-requests
    https://developer.wordpress.org/reference/functions/wp_remote_get/

    PS: sorry, forgot to mark this as “not a support question”

Viewing 13 replies - 1 through 13 (of 13 total)
  • Thread Starter danielrufde

    (@danielrufde)

    Oh and I see that there is some content-md5 header at least, which might be helpful.

    Plugin Author nintechnet

    (@nintechnet)

    Did you check the amount of RAM used by all 3 steps? You need to check the “Debug” option below the log. That’s odd you reached 160MB at step 3.

    We’ll need to improve the scanner. For instance, we’ll split the plugins and themes scan into small processes (1 for each theme/plugin) rather that one for all themes and one for all plugins. We’ve seen users having over 50 plugins on very slow servers and a timeout could occur.
    WordPress’ wp_remote_get function downloads the whole file at once, but usually that’s not really a problem as its installation file is only 14Mb.
    I’ll check the alternative to file_get_contents(), it’s true it can consume way too much RAM, just like ZIP files extraction.

    Thread Starter danielrufde

    (@danielrufde)

    In my case the customer also has many plugins and files (as you can see almost 100k), so this is probably a similar issue. In my own scripts I have made good experience with the proposed solutions which worked on all servers and scanned big files in a very short amount of time (just a simple script to search for specific strings).

    As far as I can see in the documentation wp_remote_get accepts arguments for streaming and blocking/non-blocking mode. But I am not sure if the method from WordPress may be the best solution and if it even works as expected, personally I would write my own implementation using curl with streaming and waiting until all bytes have been written and the checksum matches (so this is more standalone).

    https://www.ramielcreations.com/using-streams-in-wordpress-http-requests

    Regarding the debug option, the whole debug.log file in the relevant nscan… folder contains only this:

    1605872501~~1~~Processing step 1/14
    1605872501~~1~~Checking NinjaScanner files integrity
    1605872501~~8~~Using local cached version of checksums
    1605872501~~1~~Files integrity is O.K
    1605872501~~8~~Process executed in 0.01 seconds and used 155 MB of memory
    1605872504~~1~~Processing step 2/14
    1605872504~~1~~Building file's list
    1605872505~~1~~Total files found: 99.376
    1605872505~~1~~Building database posts and pages checksum
    1605872505~~1~~Found 0 posts and 51 pages in the database
    1605872505~~8~~Process executed in 0.78 seconds and used 155 MB of memory
    1605872510~~1~~Processing step 3/14
    1605872510~~1~~Skipping WordPress core files integrity check
    1605872510~~4~~Error: E_ERROR (Out of memory (allocated 162529280) (tried to allocate 14684160 bytes) - 523)
    1605872521~~1~~Cancelling scanning process
    1605872525~~1~~Cancelling scanning process
    1605872527~~1~~Cancelling scanning process
    1605872527~~1~~Cancelling scanning process

    For me it looks like there might be issues with not flushing things on every write (not sure if everything is appended line by line and after every fwrite an fflush is done which is recommended) or if everything is stored in one big array and then written at once (which could hit the 160 MB) and the memory is not released directly as the variables are not unset (maybe).

    In my own scripts I use generator functions (yield) with the iterators and directly pass through the needed values and directly append them to the relevant logfile one by one, flush after each write and then close the file handle.

    I guess this may lead to some (bigger) refactorings, like you mention.
    Let me know if it would be helpful if I send you my own scripts that I use to hunt for malicious files / code.

    • This reply was modified 5 months, 3 weeks ago by danielrufde.
    • This reply was modified 5 months, 3 weeks ago by danielrufde.
    Plugin Author nintechnet

    (@nintechnet)

    I was able to reproduce the issue by reducing the PHP memory limit to 96MB and uploading 200k files in a folder: v3.x crashed when building the files list, while v2.x crashed later on while comparing the two snapshots.
    The problem is the snapshot: we use a serialized array and that crashes the site when there’re too many files and not enough memory.
    You can see in the PHP error message the line number where the error occurred: 523

    522   // Save snapshot
    523   file_put_contents( NSCAN_TMP_SNAPSHOT, serialize( $snapshot ) );
    

    Using json_encode/json_decode could save a couple of megabytes, but that wouldn’t be enough.
    I tried to open and unserialize the 25MB snapshot files and checking the memory:

    echo "Start: ". memory_get_peak_usage( false ) . "\n";
    foo = file_get_contents( $snapshot );
    echo "Start: ". memory_get_peak_usage( false ) . "\n";
    $foo = unserialize( $foo );
    echo "Start: ". memory_get_peak_usage( false ) . "\n";
    

    The result:

    Start: 428208
    Start: 26234264
    Start: 153916088
    

    Using json_decode:

    Start: 429144
    Start: 23,805,648
    Start: 136455120
    

    I would need to find an alternative, or to split the snapshot into small parts.

    Thread Starter danielrufde

    (@danielrufde)

    Ah, good to know. I wasn’t aware that the number is the line number as this was not clear to me at first. Correct, I found this afterwards too after digging a bit deeper and checking all file and array related functions like file_get_contents and file_put_contents. As there is no xdebug on the customer server and I have no copy of all files (which would be way too much) I had to do a mnaual code review.

    I tested and evaluated some solutions (mgspack, fixed array sizes, array_chunk, …) and it’s probably better to stream the data to prevent such bottlenecks (not sure if JSON/CSV makes sense) using some streaming (append only) approach. The data could be written line per line using the aforementioned solutions and some custom data format (per entry – which reminds me of monolog for logging and other solutions). Ideally there should not be too much data be kept in the memory or be processed/written/read the same time.

    Everything else (bigger chunks of data, arrays with multiple elements and so on) might lead to issues in environments where resource constraints exist. I doubt that a single line / entry might hit more than a few MB memory consumption.

    At least my (very simplistic line-per-line reading) scripts with the generator syntax never hit any limits and with this approach we could probably remove some limits like the filesize limit for scanning files (for example in one case hackers used a renamed phar file to store and load files, they used png as file extension, which leads me to another thing that is currently not yet covered afaik – chameleon attacks).

    And by doing this in small steps instead in one huge step it is often much faster due to the lower memory and resource consumption and no swapping or other negative effects occur.

    Anyways, I hope the input is helpful to find a solution that scales well with every server / setup and we can scan also such big WordPress instances without such memory issues.

    PS: sorry for the lengthy reply, I’ve tried and tested several possible solutions and wanted to share my thoughts with you and recommend an efficient solution to solve such problems once and for all.

    • This reply was modified 5 months, 3 weeks ago by danielrufde.
    Plugin Author nintechnet

    (@nintechnet)

    And using relative paths instead of absolute paths would save several MB of memory: almost 20MB with 200k files. The scanner could recreate the absolute path by prepending ABSPATH to the filename when needed.

    Thread Starter danielrufde

    (@danielrufde)

    I’ve tested the wp_remote_get improvement and it makes some difference.

    Here are the results from my local ddev setup including a diff of my changes:

    --- lib/scan.php	2021-01-29 01:28:25.000000000 +0100
    +++ lib/scan.php	2021-01-29 01:29:53.000000000 +0100
    @@ -568,6 +568,8 @@
     		$res = wp_remote_get(
     			$wp_zip_url,
     			array(
    +				'stream' => true,
    +				'filename' => NSCAN_CACHEDIR ."/$wp_zip",
     				'timeout' => NSCAN_CURL_TIMEOUT,
     				'httpversion' => '1.1' ,
     				'user-agent' => 'Mozilla/5.0 (compatible; NinjaScanner/'.
    @@ -585,11 +587,10 @@
     			return false;
     		}
     
    -		if ( $res['response']['code'] == 200 ) {
    -			// Save the ZIP file:
    -			file_put_contents( NSCAN_CACHEDIR ."/$wp_zip", $res['body'] );
    -
    -		} else {
    +		if ( $res['response']['code'] != 200 ) {
    +			if (file_exists( NSCAN_CACHEDIR ."/$wp_zip" )) {
    +				unlink( NSCAN_CACHEDIR ."/$wp_zip" );
    +			}
     			nscan_log_warn( sprintf(
     				__('HTTP Error %s. Skipping this step, you may try again later', 'ninjascanner'),
     				(int)$res['response']['code']
    

    Without streaming:

    1611878659~~1~~Processing step 3/14
    1611878659~~1~~Checking WordPress core files integrity
    1611878659~~8~~Downloading wordpress-5.4.zip from wordpress.org
    1611878676~~8~~Building files list from ZIP archive
    1611878706~~8~~Using sha1 algo
    1611878729~~1~~Files integrity is O.K
    1611878729~~2~~Additional/suspicious files: 1
    1611878729~~8~~Process executed in 70.27 seconds and used 42.44 MB of memory (NinjaScanner additional memory: 37.02 MB).
    1611878730~~1~~Processing step 4/14

    With streaming:

    1611879072~~1~~Processing step 3/14
    1611879072~~1~~Checking WordPress core files integrity
    1611879072~~8~~Downloading wordpress-5.4.zip from wordpress.org
    1611879087~~8~~Building files list from ZIP archive
    1611879114~~8~~Using sha1 algo
    1611879137~~1~~Files integrity is O.K
    1611879137~~2~~Additional/suspicious files: 1
    1611879137~~8~~Process executed in 64.56 seconds and used 5.42 MB of memory (NinjaScanner additional memory: 0.00 MB).
    1611879137~~1~~Processing step 4/14
    • This reply was modified 3 months, 2 weeks ago by danielrufde.
    • This reply was modified 3 months, 2 weeks ago by danielrufde.
    • This reply was modified 3 months, 2 weeks ago by danielrufde.
    • This reply was modified 3 months, 2 weeks ago by danielrufde.
    Plugin Author nintechnet

    (@nintechnet)

    Thanks for submitting it.
    I quickly tested it:
    Before:

    29-Jan-21 17:26:11 DEBUG Downloading wordpress-5.6.zip from wordpress.org
    29-Jan-21 17:26:26 DEBUG Process executed in 15.09 seconds and used 52.77 MB of memory (NinjaScanner additional memory: 46.68 MB).
    

    After:

    29-Jan-21 17:28:35 DEBUG Downloading wordpress-5.6.zip from wordpress.org
    29-Jan-21 17:28:45 DEBUG Process executed in 10.01 seconds and used 6.09 MB of memory (NinjaScanner additional memory: 0.00 MB).
    

    I’m not really surprised by the difference, wp_remote_get() without streaming is not really suitable for big files.
    I will add it to the nscan_wp_repo_download() function too, because some plugin are quite big too.

    Plugin Author nintechnet

    (@nintechnet)

    Thread Starter danielrufde

    (@danielrufde)

    Nice, it seems this has reduced cases where the scan failed (at this step).

    I did a few more tests with a small WordPress instance.
    There might be a bigger difference on bigger websites but adding the S(tudy) flag to the regular expressions seems to make the file scan a bit faster.

    You can see my changes at https://gist.github.com/DanielRuf/b8106c23e99a16ebf6df1d12225d5464/revisions#diff-b1c2261743078fa6372874f3b3945356eeef0f42e77ff2d21e91df9526a850a1

    Without these changes:

    13-Feb-21 16:06:10 INFO  Processing step 13/14
    13-Feb-21 16:06:25 INFO  No suspicious file detected (5.536 files checked)
    13-Feb-21 16:06:25 DEBUG Process executed in 15.45 seconds and used 52,74 MB of memory (NinjaScanner additional memory: 3,28 MB).

    With these changes:

    13-Feb-21 16:14:22 INFO  Processing step 13/14
    13-Feb-21 16:14:37 INFO  No suspicious file detected (5.537 files checked)
    13-Feb-21 16:14:37 DEBUG Process executed in 15.19 seconds and used 52,74 MB of memory (NinjaScanner additional memory: 3,28 MB).

    Unfortunately I do not have access to other test instances to verify how much this impacts scans for big(ger) websites. Maybe you can test this with some more test instances and example cases (additional custom signatures for example).

    Plugin Author nintechnet

    (@nintechnet)

    Which version of PHP are you using?
    Since version 7.3, PHP uses PRCE2 (PCRE 10+) which, by default, should always perform studying.

    Thread Starter danielrufde

    (@danielrufde)

    According to the hosting settings and phpinfo(); the websites uses PHP 7.4.

    Plugin Author nintechnet

    (@nintechnet)

    phpinfo should also show the PCRE version (“PCRE Library Version”), which should be 10.36 or so.
    You would probably need to test the S flag with PHP 7.2 or below.

Viewing 13 replies - 1 through 13 (of 13 total)
  • You must be logged in to reply to this topic.