An easy to implement web scraper for WordPress. Display realtime data from any websites directly into your posts, pages or sidebar.
For use within themes:
<?php echo wpws_get_content($url, $selector, $xpath, $wpwsopt)?> (selector or xpath is optional - you may use either of these)
Example usage in theme:
<?php echo wpws_get_content('http://google.com','title','','user_agent=Bot+at+mysite.com&on_error=error_show&')?> (Display the title tag of google's home page, using My Bot as a user agent)
For use directly in posts, pages or sidebar (text widget):
[wpws url="" selector=""]
Example usage as a shortcode:
[wpws url="http://google.com" selector="title" user_agent="Bot at mysite.com" on_error="error_show"] (Display the title tag of google's home page, using My Bot as a user agent)
Other supported arguments (for theme tag / shortcode) are as mentioned below. Only
selector are required. All the rest are optional:
/[aeiou]/will clear all single lowercase vowel from the output. This Regex reference will be helpful.
clear_regexbut you can specify a CSS selector instead of regex.
replace_textbefore the scraper flushes its output. For example
/[aeiou]/will replace all single lowercase vowel from the output. This Regex reference will be helpful.
replace_regexbut you can specify a CSS selector instead of regex.
basehref="http://yahoo.com", will convert all relative links to absolute by appending
http://yahoo.comto all href and scr values. Note that basehref needs to be complete path (with http) and no trailing slash.
on_error="screwed!"will output 'screwed!' if something goes wrong in the scrap. If ignored, the default value specified in plugin settings will be used.
iconvcharset conversion of scraped content. You should specify the charset of the source url you are scraping from. If ignored, the default encoding of your blog will be used.
<a><p>to be striped off. Only the text content within these tags will be displayed. This can be used to strip off all links etc. If ignored, no tags are striped.
<a><p>to be removed. These tags and content within them will be removed. If ignored, no tags are removed.
urldecodefor URLs with special characters. Set to 0 if you do not want to use it. Default value is 1.
xpathdecodefor xpath queries with special characters. Set to 0 if you do not want to use it. Default value is 0.
This section specifically details usage of selectors which are the heart of WP Web Scraper. For parsing html, the plugin uses phpQuery and hence an elaborate documentation on selectors can be found at phpQuery - Selector Documentation.
Frankly, selectors are a standard way to query the DOM structure of the scraped html document. phpQuery uses CSS selectors (like jQuery) and hence those familiar with CSS selectors will find themselves at home. To get you started, you can use elements, #ids, .classes to identify content. Here are a few examples:
<td>on the page with a class 'specialhead'.
<td>of the fourth
<table>within the page.
<div>inside the first element with id 'header'.
Since version 2.3, you can also optionally use xpaths to query your content. Details on usage of xpath can be found in the PHP documentation. XPaths can be handy while trying to scrape non-standard html tags or while working with RSS / ATOM or generic XML feeds.
At times you may have to create scraping paged on the fly to fetch content from a single underlying source by passing multiple get (page) arguments to it. For this, you may use an inbuilt feature which will convert specific text mentioned in url or postargs of your scrap to its corresponding value based on some get arguments specified on that page.
For example, if you want a page to scrap symbols on reuters.com/finance dynamically based on user input then:
This will replace
___symbol___ in the url with
CSCO.O in realtime. You can use multiple such replacement variables in your url or postargs. Such replacement variables should be wrapped between 3 underscores. Note that field names being passed this was are case-sensitive. Having 'FieldName' vs. 'fieldname' makes a difference.
You can also use the special variable
___QUERY_STRING___ to replace the complete query string post ?
Using the callback function, you can extend the plugin to do some advanced parsing. Simply put, its a function which will parse and return your data. Your callback function can reside in functions.php of your theme. The function should take a single string parameter, parse it and return a string as output.
Requires: 2.8 or higher
Compatible up to: 3.1.4
Last Updated: 2013-3-8
0 of 1 support threads in the last two months have been resolved.
Got something to say? Need help?