Clean HTML from MS Word. (26 posts)

  1. vavroom
    Posted 9 years ago #

    I hope someone can help me. I am at my wits' ends. :(

    I am looking for a WYSIWYG editor that produces standards compliant code, particularly when copy/pasting from MSWord. I've just spent the last 7 hours looking for a solution, to no avail.

    Before you go on a tangent and rant against WYSIWYG (as so many people seem to do when someone is asking WYSIWYG questions), I ought to point out that there are situations where it is necessary to use them. Myself, I have been using Notepad to handcode html since before Lynx was pretty much the only web browser option (yes, i'm that old). But the site will be used as a CMS by people who are not HTML savvy and who have much better things to do than to learn html.

    I have downloaded and tried several plugins:
    - Chenpress
    - WYSI-Wordpress
    - Xinha4WP

    I had a look at WYSIWYG Pro.

    I had a play with X-Valid

    Obviously tried the built-in Tiny MCE.

    None of these allow you to cut/paste from MSWord and get clean HTML, free of extra classes and spans and formatting.

    The only editor that seems to handle that properly is XStandards. But I can't seem to find anyone who's done an XStandard integration into WordPress. I've spend quite a bit of time on their site, and I just don't see how to do such an integration.

    Surely, I'm not the only one that has such a need? It's not asking for much, just want to be able to cut/paste from Word into the editor and not have cr*p code show up.

    Has anyone found a solution?

    //Edit: Saving a word doc to text and then copy pasting from notepad into WP is not a desirable solution.

  2. Chris_K
    Posted 9 years ago #

    Pure Text? http://www.stevemiller.net/puretext/ If nothing else, it's a helluva lot quicker/easier than the Word/Notepad/Tiny MCE shuffle.

    Do you want ANY formatting to come through from Word? Or are you using Word purely for spelling/grammar checking only? If the former, it'll be a bit of a challenge to find a tool that will delete some but not all. I wrote about Pure Text a while back and got an interesting comment that may or may not interest you (http://www.solo-technology.com/blog/2006/07/09/pure-text/ and it's the only comment for the article).

  3. vavroom
    Posted 9 years ago #

    Thanks Solo,

    That PureText is one option that I'll keep as an alternative, but it involves my users downloading, installing and learning it, including using hot keys instead of mouse/menus. Dealing with unsophisticated users here (no judgement on them, just stating a fact that computers is a necessary evil in their lives, the less they have to learn about them, the better).

    Incorporating xstandard would be the optimal solution, but that certainly seems too complicated for my coding skills :(

    As for keeping or not some formatting, I don't know what the users will be doing for sure, but I suspect that they might wish to keep some of the formatting. Though that can be retrained. Hmmm.

  4. flakkito
    Posted 9 years ago #

    If I were you, I would have them copy/paste their old files into wordpad, then copy /paste int WP. Alot if not all of the formatting will be removed. Educate them as to how to use WP's built in toolbar, which is as simple as writing email. If they won't learn that, then that's a problem. They should be willing to try:)

  5. Kadmous
    Posted 9 years ago #

    Hi vav,

    Before I proceed with my post, I like to note that this is my first post on this forum. I am proud to power my site with WordPress and I am very thankful to this community for helping me termendously with my WordPress world. Thanks everybody.

    I understand your problem. I also have some members who don't like to mess with Word, WYSIWYG and HTML. And I know there is no way around so far.

    But I approached this problem with a different angle. Because they dont like the HTML, but I know HTML, then I made my own quicktags buttons. Those buttons consist of functions which I know that my members need them. In short, I made our own Word-quicktags-WYSIWIG editor.

    Knowing HTML as you said, just decide what functions you need. If you need larger text, make it a CSS, put it in your template CSS file, then add it as a button to the quicktags.

    Talk to your members to know what is their needs. You make the HTML editing one time for all. And from time to time you improve it, delete some unused buttons, add new.

    Here is a guide by Tamba how to edit the quicktags: http://www.tamba2.org.uk/wordpress/quicktags/

  6. vavroom
    Posted 9 years ago #

    flakkito, yes, I can educate the user. I've been educating users for years. Frankly, I'm tired of *having* to educate users. Why not have a standards compliant wysiwyg editor that is easy to integrate? Oh, forgive the rant here, it's a sore topic for me. I have done way too much "educating" of way too many users. It's a fact of life, but the thing is, it really shouldn't be that hard to copy/paste and strip the *#&*^&@*$() added code from MS.

    @kadmous, the issue isn't so much styling once in WP, but the ability to preparing content in Word, off line, and then adding the content on the site.

  7. yosemite
    Posted 9 years ago #

    but the thing is, it really shouldn't be that hard to copy/paste and strip the *#&*^&@*$() added code from MS.

    Lots of commercial software packages have made attempts at this, none of them seem that robust (they are only concerned with making MS Word work in their application).

    This is a MS issue. Take a look at IE or Outlook where little concern is shown for accepted standards or compatibility. Word was never designed for HTML, it's inclusion was an afterthought (approached, as usual, with minimal consideration for compatibility except with other MS products).

  8. vavroom
    Posted 9 years ago #

    Yosemite, I realise it's an Office issue, but xstandard does manage to strip the "uglies" from a simple copy/paste from Word.

    I guess bottom line in this case is "I wish I knew enough php/javascript to know how to integrate xstandard as the editor into WP". Thing is, there are a few people that discuss it here and there, no one seems to have done it.


  9. nutsmuggler
    Posted 9 years ago #

    I have the same problem, but I was thinking of tackling it from a different angle.
    I was thinking of developing a xsl stylesheet to convert a odt (open-office open document) into a clean html. The thing is NOT that difficult, given some knowledge of xml. You can open any word file in open office, and then use a custiom xsl filter to get a valid and clean html output. It would be very nice to have a 'wordpress html' filter in openoffice.
    Yet, I have just started off, and I am a beginner. What do you think of this solution?

  10. Bryan Davis
    Posted 9 years ago #

    Admittedly it's a while since I used MS Word but can you not create a document and then save it as plain text (.txt)? All the user has to do that's extra is a "Save As..."; not too arduous even for the technophobic.
    Presumably this would purely give out the content without MS formatting and allow a safe copy/paste into the WP WYSIWYG for final formatting.

    Or maybe I'm being naive...

  11. vavroom
    Posted 9 years ago #

    @tptboy, the problem with save as text, is that you have to then open notepad, or editpad, or other text only editor to do the copy/paste. Extra steps that I may not have a problem with doing, but that I know other people are quickly going to find cumbersome. Opening the text file in MSWord leaves the silly MS formatting in when you copy/paste :(

    @nutsmuggler, the problem with your solution is, again, forcing someone to use a second or third solution. OO is good software, I had it running in a non-profit I was involved with. But it's beyond the scope of most web designers to be able to convince an entire organisation to switch from one office suite to another. Your solution might work well for you, sadly, i don't think it'll work for me in this case :(

  12. nutsmuggler
    Posted 9 years ago #

    Excuse me, you said you are looking for a WYSIWYG editor that produces standards compliant code. OpenOffice can be such editor. You don't need to force your customers to switch to OO, you just get their word documents, open them in OO and use the (alas, hypotetic) wordpress filter to produce standars clean wordpress compliant html. OO is also free. The only drawback, at this stage, is that such filter does not exist yet, but I'll keep you posted.

  13. Pizdin Dim
    Posted 9 years ago #

    "Presumably this would purely give out the content without MS formatting and allow a safe copy/paste into the WP WYSIWYG for final formatting."

    This is exactly the approach I always recommend to people. Combined with the excellent Markdown (as a WP plugin) my customers get a hassle-free and easy way of creating content. Markdown is so easy to learn too. People usually learn it in under 10 minutes and once they learn it, they don't forget.

  14. vavroom
    Posted 9 years ago #

    @nutsmuggler, sorry, I didn't make myself clear. I am looking for a standards compliant wysiwyg editor that works in WP, not a stand alone one. Like Tiny currently is the wysiwyg editor for WP, only it's not very good. Using a 3rd application is not what I want.

    @pizdin, yes, that's an approach that could be taken. But I go back to the fact that non-computer savvy people have better things to do than to learn markup. Any markup. To you and me, simple markup language like that is a breeze. But to a lot of people, it's a different story.

    The bottom line is, it should be possible to go straight from one wordprocessing application, be it Word, Wordperfect, OpenOffice, and copy/paste content into a CMS's editor window without losing formatting, and without having extraneous code added. This is not a rant against wordpress, btw.

    Which brings me back to: Has anyone managed to integrate XStandards as the editor in WordPress? Maybe I should start a thread titled that... :)

  15. AmbushCommander
    Posted 9 years ago #

    Perhaps you should look for the solution in a different place: instead of a standards-complaint WYSIWYG editor, try a standards-compliant HTML filter to integrate into WordPress directly. Not sure how well that filter will deal with Microsoft's proprietary tags though. And it doesn't have a WP plugin yet, although the API is so simple that I think doing that would be trivial.

    As for XStandard, this seems to be an application in and of itself (not Javascript), so it would require users to install something. Also, since it's client side, there's no guarantee that the input coming to you will be compliant. You really ought to look for something server-side. If the server can transparently clean up the code, it doesn't matter how bad or good the WYSIWYG editor is as long as it doesn't drop any tags.

    Sorry if this is resurrecting a dead topic.

  16. likoma
    Posted 9 years ago #

    vavoom: I use a custom version of the Advanced WYSIWYG Editor plugin modified to add the "Paste from Word" and "Paste as Plain Text" buttons. I show it in action in this help video. I tell my clients to "use the 'W'" to do their pasting from Word and use the "clipboard 'T'" to paste from everything else. I'm to the point where I've started to remove the regular paste button (but of course they can still just paste it in (ctrl + v).

    Here are lines 34-39 of "advanced-wysiwg.php" plugin mentioned above:

    function extended_editor_mce_buttons_2($buttons) {
    return array(
    "cut", "copy", "paste", "pastetext", "pasteword", "undo", "redo", "separator",
    "table", "sub", "sup", "forecolor", "backcolor", "charmap", "separator",
    "code", "fullscreen", "wordpress", "wphelp", "cleanup" );

    You do need to make sure to upload the "paste" folder from the TinyMCE install into the plugins directory of /wp-includes/js/tinymce/plugins/paste .

    If you've been through this, you know what I'm talking about. If you have no idea what I'm talking about, I can try to explain it better. There is more in-depth explanation in this thread. I like this option because it's just a small addition to the standard WP install and I don't have to reconfigure 47 things when I upgrade.

    Let us know how it goes.

    - Bradley

    Later: Just saw this plugin which seems to do the same thing but maybe easier to install?

  17. ronfaur
    Posted 9 years ago #

    Hi all, i loved reading this discussion, since i've been having the same exact problem vavroom posted here, and had no real solutino to the problem, for many years.
    I believe i finally found a good solution, it involves some client side scripting, but it solves ALL my problems.
    The approach is quite simple, I'm using a DIV with contentEditable set to true, then i'm catching the onpaste event (which takes care of all pastes - edit menu, right click and ctrl+v), and ondrop (which takes care of dragging content into my rich text editor area)
    when pasting or dropping occurs, i go over the entire HTML in the editor area, and simply remove all unneeded tags, styles and classes, resulting with clean HTML code.
    This does not require user to do a thing, straight copy paste from any rich html application (word,excel,another web page). I also disabled the undo feature (handling ctrl+z), making sure user cannot undo the automatic HTML formatting:

    <html >

    <body >

    overflow: auto;
    border: thin inset;
    font-weight: normal"


    <SCRIPT language="JavaScript">

    function handleImportsTEXTEDITOR1() {

    var aSourceHTML = TEXTEDITOR1.innerHTML;
    var newHTML = document.createElement('SPAN');
    newHTML.innerHTML = aSourceHTML
    nNode = processNode(newHTML);
    if (nNode != newHTML)
    newHTML = nNode;

    TEXTEDITOR1.innerHTML = newHTML.innerHTML;
    document.all.TEXTEDITOR1Message.innerText = 'Note: Some of the content you just pasted have been modified to conform to our web standards'
    window.setTimeout('document.all.TEXTEDITOR1Message.innerText = ""', 8000);


    function processNode(obj) {

    for (var i=0; i < obj.childNodes.length; i++) {
    var nObj = processNode(obj.childNodes(i));
    if (nObj != obj)

    if (!validTag(obj)) {
    var newNode = document.createElement('SPAN')
    newNode.innerText = obj.innerText
    return newNode;
    } else {
    try {
    // Removing classes
    var attr = obj.className + '';
    if ((attr != '') && (attr != 'undefined')) obj.className = '';
    // Removing styles
    var attr = obj.style + '';
    if ((attr != '') && (attr != 'undefined')) obj.style = '';
    } catch (e) {}

    return obj;


    function validTag(node) {
    // Cleanup function, all tags not listed here will be removed!!
    var nodeName = node.nodeName.toUpperCase();
    if (nodeName == '#TEXT') return true;
    else if (nodeName == 'BR') return true;
    else if (nodeName == 'FONT') return true;
    else if (nodeName == 'B') return true;
    else if (nodeName == 'STRONG') return true;
    else if (nodeName == 'SPAN') return true;
    else if (nodeName == 'I') return true;
    else if (nodeName == 'BLOCKQUOTE') return true;
    else if (nodeName == 'A') return true;
    else if (nodeName == 'DIV') return true;
    else if (nodeName == 'P') return true;
    else if (nodeName == 'OL') return true;
    else if (nodeName == 'UL') return true;
    else if (nodeName == 'LI') return true;
    return false;

    function onPasteHandler(e) {
    setTimeout(function() {
    // editor cleaning code goes here
    }, 1); // 1ms should be enough

    function myKeyHandler() {
    if (event.keyCode != null) {
    if (event.keyCode == 90) { // "z" pressed
    if (event.ctrlKey) { // CTRL+Z pressed - disabling
    event.returnValue = false;

    TEXTEDITOR1.contentEditable = true;
    TEXTEDITOR1.innerHTML = '<FONT face=Arial size=2>Preload your HTML content here</FONT>';
    TEXTEDITOR1.document.body.style.margin = '0px';
    TEXTEDITOR1.document.onkeyup = handleChangeTEXTEDITOR1;
    document.attachEvent("onkeydown", myKeyHandler);



    let me know if this works for you??

  18. AmbushCommander
    Posted 9 years ago #

    Client-side filtering is a bad idea for anything serious and should not be trusted. If you want to get rid of the MsoNormals Microsoft Word is so fond of, be my guest, but realize that anything done in JavaScript can (easily) be circumvented.

  19. ronfaur
    Posted 9 years ago #

    Client-side scripting, and specifically JavaScript has been around since before 1997 (when netscape 3 was introduced). Javascript and client side scripting have evolved (already 5-6 years ago) to the point that you can code a client side application (using a scripting language, and html only) that will rival the most sophisticated windows applications.

    I believe it is time to let go of the old notion "Client-side is bad idea for any serious coding" - it is a mature and reliable platform as well as the web browsers you are dealing with (IE/FFox), and it has been for a while.

    If the argument is about the ability to turn scripting off on your browser - true, but when it comes down to it, no site today will dislpay properly if you turn scripting (or even cookies) off, plus the average user does not know how to turn scripting off, so its really not an argument.
    If the argument is JavaScript errors? those can be easily caught and handled, and any programmer knows he needs to catch possible errors, specifically in the web environment.

    Regarding the solution i offered, it is the only solution that i've seen that actually solves the problem (or even come close), and can be easily modified to your own company needs, if it is to block specific HTML tags, or to remove unnecessary styling, this does the trick!!
    and all of it is a few lines of code, no need for downloads, no need to educate the user, its is seemless.

    Love client side :) dont hate ... (For no real reason that is)

  20. AmbushCommander
    Posted 9 years ago #

    You're dead on. Yes: JavaScript is here to stay, and you'd be a fool not to use it. Yes: JavaScript is a fully featured programming language: Mozilla Firefox is practically built on JavaScript. Yes: the average user does not turn off scripting.

    But there's one issue that no amount of client-side scripting can fully replace: filtering incoming data. "Client-side is bad idea for any serious coding" does not equal "Do not trust data that comes from the client." The former is false, the latter true. Because JavaScript can be turned off, *any* security checks (for example, removing undesirable tags and attributes), can easily be circumvented.

    Let's give an example. First the normal use case:

    1. Bob wants to post a MSWord document. He copy pastes it into a text editor
    2. JavaScript (client side) transparently cleans up the formatting for him
    3. Bob presses submit, it gets sent to the server, which DOESN'T do any other checking, and puts it on the result page.

    How to abuse:

    1. Mallory surfs to the web page and turns of JavaScript. She fills in the web form with malicious, raw HTML
    2. Data gets sent to server, since the server doesn't do any checking, XSS and other meanies get onto the HTML page.

    JavaScript is great for thwarting good-faith incompetency/blundering, but against a determined attacker it is no good. You must implement server-side filtering with something like HTML Purifier.

    P.S. Theoretically speaking, websites should degrade gracefully: when JavaScript is turned off, they should still function, albeit without any of the client-side flashiness/polish. Alas, this is not true of many websites, but most still are like that. Personally, I use NoScript to block scripting on all sites I visit, and then enable scripting on a case by case basis.

  21. vkaryl
    Posted 9 years ago #

    Are you involved with HTML Purifier?

  22. AmbushCommander
    Posted 9 years ago #

    Yes. :-) I'm quite proud of the library.

    Don't get me wrong: it's really frustrating seeing people constantly botching HTML filtering. It's a *hard* problem to solve. You can read this comparison for more info.

  23. vkaryl
    Posted 9 years ago #

    S'okay with me, I just figured you were; but I do think you should have so stated to begin with. I'm fully aware of the problems, and your library is one I've looked at myself for use - and had I not been in the middle of boot drive problems, would have downloaded before now.

  24. AmbushCommander
    Posted 9 years ago #

    Sorry about that, I'll be sure to make it clear in the future.

  25. Pizdin Dim
    Posted 9 years ago #

    "I've just spent the last 7 hours looking for a solution, to no avail."

    Don't despair, these things can take much more than seven hours.

    "Before you go on a tangent and rant against WYSIWYG (as so many people seem to do when someone is asking WYSIWYG questions), I ought to point out that there are situations where it is necessary to use them."

    I won't go on a tangent but have recommended the excellent Markdown Extra for quite a while now and most of my clients love it. While it takes a 15 minute investment to learn, it's very intuitive and the (simple) rules are hard to forget. I too have had some users who initially resisted switching from Word but they got used to Markdown very, very quickly and now won't go back. I strongly suggest you offer this as an alternative.

  26. gooseflight
    Posted 9 years ago #

    We have just launched version 1.0 of blog.dot. This template and associated DLL produces a clean HTML export from Microsoft Word. If you install the MySQL ODBC driver you can post directly to your WordPress blog from MS Word.

    Download version 1.0.

Topic Closed

This topic has been closed to new replies.

About this Topic