Support » Fixing WordPress » WXR File Splitter

  • RangerPretzel

    (@rangerpretzel)


    Have a gigantic WXR file and want it split into smaller parts? Download this free application. It requires a Windows PC and the .NET 2.0 framework (which most people have installed, I think.)

    If you run Windows XP and don’t have .NET 2.0, you can download .NET 2.0 for free from Microsoft’s website or through Windows Update. (Vista and Win7 users already have .NET installed)

    http://www.rangerpretzel.com/content/view/20/1/

    Why did I write this program? Well, a friend of mine needed help moving a 32MB WXR file from one WordPress server to another and there were 2700+ entries in the file and her new server was choking on the large WXR file. Instead of splitting it “by hand” into 100 item chunks (which would have taken forever), I wrote this program to do the heavy lifting for us.

    Check it out and let me know how you like it.

Viewing 12 replies - 1 through 12 (of 12 total)
  • Moderator Jan Dembowski

    (@jdembowski)

    Forum Moderator and Brute Squad

    That sounds like a useful tool. Do you have the source code posted somewhere?

    Don’t worry, I’m not asking to start any GPL/OSS fight, I just want to know how you broke up the WXR file into usable chunks without messing up the XML.

    Thread Starter RangerPretzel

    (@rangerpretzel)

    I have not posted the source code, mostly because I wrote the bulk of the program in the space of an hour or two. As such, I didn’t spend much time organizing it. I would want to clean/tighten up the code before releasing it.

    As for how it works, it’s quite simple.

    I decided that I would create a class that had a “header”, a List of “Items” and a “footer”. The code opens the file stream and reads forward until it hits the first <ITEM> tag that it finds. This becomes the “header”.

    After that, it grabs every <ITEM> and stuffs each one into a list/array. Then anything beyond the last ITEM becomes the “footer”.

    Once all of that is loaded into memory, it’s easy to create new WXR files. Just write out, say, the header, 1 thru 100 items, and the footer. Repeat for 100 thru 199, etc. etc.

    Unfortunately this wont work for a Mac or Linux platform, does anyone know of a program for OSX that does the same thing?

    @rangerpretzel: that program kicks serious butt. y’oughta tag that with “keeper”. much thanks!

    Here is some python code I wrote (so it should work just fine on a mac or linux):

    http://wordpress.pastebin.ca/2004312

    Just paste it into a text file (e.g. ‘splitter.py’). Go to the directory containing the newly created file. Make sure the file is executable. Then call

    ./splitter.py <name_of_your_wxr_file> <desired_number_of_slices>

    In truth this code hasn’t been extensively tested, but I’ve used it a few times now on various wxr files and it seems to work (it’ll output a bunch of separate wxr files that you can then import separately)

    Just realized my link I posted above expires in a month. Here’s the actual python script:

    #!/usr/bin/python
    
    # This script is designed to take a wordpress xml export file and split it into some
    # number of chunks (2 by default). The number of lines per chunk is determined by counting
    # the number of occurences of a particular line, '<item>\n' by default, and breaking up the
    # such that each chunk has an equal number occurences of that line. The appropriate header
    # and footer is added to each chunk.
    
    import os
    import sys
    import math
    
    if len(sys.argv) < 2 :
    	print 'Please specify the name of wordpress export file you would like to split'
    	sys.exit(0)
    
    try :
    	input_file = open(sys.argv[1], 'r')
    	lines = input_file.readlines()
    	(input_file_path, input_file_string) = os.path.split(sys.argv[1])
    	(input_file_name, input_file_extension) = os.path.splitext(input_file_string)
    except IOError :
    	print 'Could not open file "%s".' % sys.argv[1]
    	sys.exit(0)
    
    number_of_chunks = max(int(sys.argv[2]), 2) if len(sys.argv) > 2 else 2
    line_delimiter = '<item>\n'
    
    delimiter_count = 0
    for line in lines :
    	if line == line_delimiter :
    		delimiter_count += 1
    
    print ''
    print 'File "%s" contains %s items' % (input_file_string, delimiter_count)
    
    delimiter_count = 1.0*delimiter_count
    delimiters_per_chunk = int(math.ceil(delimiter_count/number_of_chunks))
    
    print 'Creating %s files with at most %s items each:' % (number_of_chunks, delimiters_per_chunk)
    
    header = ""
    footer = "\n</channel>\n</rss>\n"
    chunk_number = 1
    output_file_name = "%s_%s%s" % (input_file_name, chunk_number, input_file_extension)
    output_file = open(output_file_name, 'w')
    print '   Writing chunk %s to file %s...' % (chunk_number, output_file_name)
    
    delimiter_count = 0
    for line in lines :
    	if line == line_delimiter : delimiter_count += 1
    
    	if chunk_number is 1 and delimiter_count is 0 : header += line
    
    	if delimiter_count > delimiters_per_chunk :
    		output_file.write(footer)
    		output_file.close()
    		chunk_number += 1
    		delimiter_count = 1
    
    		output_file_name = "%s_%s%s" % (input_file_name, chunk_number, input_file_extension)
    		output_file = open(output_file_name, 'w')
    		print '   Writing chunk %s to file %s...' % (chunk_number, output_file_name)
    		output_file.write(header)
    
    	output_file.write(line)
    
    output_file.close()
    print 'Done!\n'

    Should set that up in the Python wrapper for OS X called Platypus so it’s real easy for people to use under OS X.

    For the exporter in Python. For me it doesn’t work (gives 0 items) in 3.0.3.
    This is a corrected version (for me, it works):

    #!/usr/bin/python
    
    # This script is designed to take a wordpress xml export file and split it into some
    # number of chunks (2 by default). The number of lines per chunk is determined by counting
    # the number of occurences of a particular line, '<item>\n' by default, and breaking up the
    # such that each chunk has an equal number occurences of that line. The appropriate header
    # and footer is added to each chunk.
    
    import os
    import sys
    import math
    
    if len(sys.argv) < 2 :
    	print 'Please specify the name of wordpress export file you would like to split'
    	sys.exit(0)
    
    try :
    	input_file = open(sys.argv[1], 'r')
    	lines = input_file.readlines()
    	(input_file_path, input_file_string) = os.path.split(sys.argv[1])
    	(input_file_name, input_file_extension) = os.path.splitext(input_file_string)
    except IOError :
    	print 'Could not open file "%s".' % sys.argv[1]
    	sys.exit(0)
    
    number_of_chunks = max(int(sys.argv[2]), 2) if len(sys.argv) > 2 else 2
    line_delimiter = '\t\t<item>\n'
    
    delimiter_count = 0
    for line in lines :
    	if line == line_delimiter :
    		delimiter_count += 1
    
    print ''
    print 'File "%s" contains %s items' % (input_file_string, delimiter_count)
    
    delimiter_count = 1.0*delimiter_count
    delimiters_per_chunk = int(math.ceil(delimiter_count/number_of_chunks))
    
    print 'Creating %s files with at most %s items each:' % (number_of_chunks, delimiters_per_chunk)
    
    header = ""
    footer = "\n</channel>\n</rss>\n"
    chunk_number = 1
    output_file_name = "%s_%s%s" % (input_file_name, chunk_number, input_file_extension)
    output_file = open(output_file_name, 'w')
    print '   Writing chunk %s to file %s...' % (chunk_number, output_file_name)
    
    delimiter_count = 0
    for line in lines :
    	if line == line_delimiter : delimiter_count += 1
    
    	if chunk_number is 1 and delimiter_count is 0 : header += line
    
    	if delimiter_count > delimiters_per_chunk :
    		output_file.write(footer)
    		output_file.close()
    		chunk_number += 1
    		delimiter_count = 1
    
    		output_file_name = "%s_%s%s" % (input_file_name, chunk_number, input_file_extension)
    		output_file = open(output_file_name, 'w')
    		print '   Writing chunk %s to file %s...' % (chunk_number, output_file_name)
    		output_file.write(header)
    
    	output_file.write(line)
    
    output_file.close()
    print 'Done!\n'

    I’ve just added a \t\t to the item variable.

    Can someone write instructions for using this on OS X?

    It sounds great!

    1º- Paste in a file and save it as “splitter.py” file.

    In a Terminal put this
    python splitter.py <name_of_your_wxr_file> <desired_number_of_slices>

    Be sure that the path of the name_of_your_wxr_file is right. If you are not sure of this, just put the wxr file in the same directory that splitter.py.

    It should work

    This instructions are the same for Linux

    Or use Platypus | Sveinbjorn Thordarson as a wrapper for Python.

    I read the earlier suggestion about Platypus, but I’m in the target audience for *using* a generated app, not for *creating* one 😉

    It may seem obvious, but all I needed was the “In a Terminal …” Now I get it.

    Thank you both!

Viewing 12 replies - 1 through 12 (of 12 total)
  • The topic ‘WXR File Splitter’ is closed to new replies.