WordPress.org

Ready to get started?Download WordPress

Plugin Directory

WP Web Scraper

An easy to implement web scraper for WordPress. Display realtime data from any websites directly into your posts, pages or sidebar.

Usage Manual

For use within themes: <?php echo wpws_get_content($url, $selector, $xpath, $wpwsopt)?> (selector or xpath is optional - you may use either of these) Example usage in theme: <?php echo wpws_get_content('http://google.com','title','','user_agent=Bot+at+mysite.com&on_error=error_show&')?> (Display the title tag of google's home page, using My Bot as a user agent)

For use directly in posts, pages or sidebar (text widget): [wpws url="" selector=""] Example usage as a shortcode: [wpws url="http://google.com" selector="title" user_agent="Bot at mysite.com" on_error="error_show"] (Display the title tag of google's home page, using My Bot as a user agent)

Other supported arguments (for theme tag / shortcode) are as mentioned below. Only url and selector are required. All the rest are optional:

  • url (Required): The complete URL which needs to be scraped.
  • selector (Required): The jQuery style selector string to select the content to be scraped. You can use elements, ids or classes for this. Further details about selector syntax in 'Selectors' section below
  • xpath: Generic xpath query can be used as an alternate query method over selectors.
  • postargs: A string of post arguments to the page you are trying to scrap. For example id=197&cat=5
  • clear_regex: Regex pattern to be cleared before the scraper flushes its output. For example /[aeiou]/ will clear all single lowercase vowel from the output. This Regex reference will be helpful.
  • clear_selector: Similar to clear_regex but you can specify a CSS selector instead of regex.
  • replace_regex: Regex pattern to be replaced with replace_text before the scraper flushes its output. For example /[aeiou]/ will replace all single lowercase vowel from the output. This Regex reference will be helpful.
  • replace_selector: Similar to replace_regex but you can specify a CSS selector instead of regex.
  • replace_with: String which will replace the regex pattern specified in replace_text.
  • replace_selector_with: String which will replace the selector specified in replace_selector.
  • basehref: A parameter which can be used to convert relative links from the scrap to absolute links. For example, basehref="http://yahoo.com", will convert all relative links to absolute by appending http://yahoo.com to all href and scr values. Note that basehref needs to be complete path (with http) and no trailing slash.
  • cache: Timeout interval of the cached data in minutes. If ignored, the default value specified in plugin settings will be used.
  • output: Format of output rendered by the selector (text or html). Text format strips all html tags and returns only text content. Html format retirns the scrap as in with the html tags. If ignored, the default value 'text' will be used.
  • user_agent: The USERAGENT header for cURL or Fopen. This string acts as your footprint while scraping data. If ignored, the default value specified in plugin settings will be used.
  • timeout: Timeout interver for cURL or Fopen function in seconds. Higer the better for scraping slow servers, but this will also increase your page load time. Ideally should not exceed 2. If ignored, the default value specified in plugin settings will be used.
  • on_error: Error handling options for cURL or Fopen. Available options are error_show (to display the error), error_hide (to fail silently) or error_show_cache (to display data from expired cache if any). Setting it to any other string will output the string itself. For instance on_error="screwed!" will output 'screwed!' if something goes wrong in the scrap. If ignored, the default value specified in plugin settings will be used.
  • htmldecode: Specify a charset for iconv charset conversion of scraped content. You should specify the charset of the source url you are scraping from. If ignored, the default encoding of your blog will be used.
  • striptags: Specify one or more tags in the format <a><p> to be striped off. Only the text content within these tags will be displayed. This can be used to strip off all links etc. If ignored, no tags are striped.
  • removetags: Specify one or more tags in the format <a><p> to be removed. These tags and content within them will be removed. If ignored, no tags are removed.
  • callback: Specify a function name which will parse the scrap as desired. Raw scrap should be an argument to the callback function and this function should return the desired processed output. Function can reside in functions.php of your theme too.
  • debug: Set to 1 to turn on debug information in form of an html comment in scrap or set 0 to turn it off. Default value is 1.
  • urldecode (only availabe in shortcode): Set to 1 to use urldecode for URLs with special characters. Set to 0 if you do not want to use it. Default value is 1.
  • xpathdecode (only availabe in shortcode): Set to 1 to use xpathdecode for xpath queries with special characters. Set to 0 if you do not want to use it. Default value is 0.

Selectors

This section specifically details usage of selectors which are the heart of WP Web Scraper. For parsing html, the plugin uses phpQuery and hence an elaborate documentation on selectors can be found at phpQuery - Selector Documentation.

Frankly, selectors are a standard way to query the DOM structure of the scraped html document. phpQuery uses CSS selectors (like jQuery) and hence those familiar with CSS selectors will find themselves at home. To get you started, you can use elements, #ids, .classes to identify content. Here are a few examples:

  • 'td .specialhead:eq(0)' will get you content within the first <td> on the page with a class 'specialhead'.
  • 'table:eq(3) td:eq(3)' will get you content within the fourth <td> of the fourth <table> within the page.
  • '#header div:eq(1)' will get you content within the second <div> inside the first element with id 'header'.

Since version 2.3, you can also optionally use xpaths to query your content. Details on usage of xpath can be found in the PHP documentation. XPaths can be handy while trying to scrape non-standard html tags or while working with RSS / ATOM or generic XML feeds.

Dynamic URLs and postargs

At times you may have to create scraping paged on the fly to fetch content from a single underlying source by passing multiple get (page) arguments to it. For this, you may use an inbuilt feature which will convert specific text mentioned in url or postargs of your scrap to its corresponding value based on some get arguments specified on that page.

For example, if you want a page to scrap symbols on reuters.com/finance dynamically based on user input then:

This will replace ___symbol___ in the url with CSCO.O in realtime. You can use multiple such replacement variables in your url or postargs. Such replacement variables should be wrapped between 3 underscores. Note that field names being passed this was are case-sensitive. Having 'FieldName' vs. 'fieldname' makes a difference.

You can also use the special variable ___QUERY_STRING___ to replace the complete query string post ?

Callback

Using the callback function, you can extend the plugin to do some advanced parsing. Simply put, its a function which will parse and return your data. Your callback function can reside in functions.php of your theme. The function should take a single string parameter, parse it and return a string as output.

Requires: 2.8 or higher
Compatible up to: 3.1.4
Last Updated: 2013-3-8
Downloads: 33,317

Ratings

3 stars
3.9 out of 5 stars

Support

0 of 4 support threads in the last two months have been resolved.

Got something to say? Need help?

Compatibility

+
=
Not enough data

2 people say it works.
0 people say it's broken.

100,1,1
100,1,1 0,1,0
100,1,1
0,2,0 100,1,1
100,1,1
100,1,1 100,1,1
0,1,0
100,1,1 100,1,1
100,1,1
0,3,0
100,1,1
67,3,2
33,3,1
0,1,0
0,1,0
0,3,0
100,1,1
100,2,2