Home > Blog > Split a file on any character in Python

Split a file on any character in Python

I need to split a big text file on a certain character. I expect I am being thick about this, but split doesn’t quite do what I want because it includes the matching line, whereas I want to split right on the matching character.

My Python answer:

def readlines(filename, endings, chunksize=4096):
    """Returns a generator that splits on lines in a file with the given
    line-ending.
    """
    line = ''
    while True:        
        buf = filename.read(chunksize)
        if not buf:
            yield line
            break

        line = line + buf

        while endings in line:
            idx = line.index(endings) + len(endings)
            yield line[:idx]
            line = line[idx:]

if __name__ == "__main__":
    import sys, os

    FORMFEED = chr(12) # ASCII 12
    basename = os.path.basename(sys.argv[1])
    for num, data in enumerate(readlines(open(sys.argv[1]), endings=FORMFEED)):
        filename = basename + '-' + str(num)
        open(filename, 'wb').write(data)

This is also useful when reading data exported from some old-fashioned Mac application like Filemaker 5 where the line-endings are ASCII 13 not ASCII 10.

This post was inspired by Lotus Notes version 8.5, which is so advanced that to save a message in a file on disk you have to export it as structured text. And if you want to save a whole bunch of messages as individual files you must forget that drag-and-drop was introduced with System 7, that would be too obvious.

Tags: ,
  1. September 1st, 2010 at 18:12 | #1

    Haha! Quite a surprise to find that the first search result on splitting files on a character wanted to split Lotus Notes structured text files, just like me! I like your solution; Very elegant, very pythonic.

  2. September 1st, 2010 at 18:51 | #2

    The complete script parsed the message headers to create filenames containing the date and subject. E-mail me if you want it, david@gasmark6.com

  3. September 2nd, 2010 at 13:23 | #3

    Thank for the offer, but my immediate problem is already solved. Just needed to diff some documents, and couldn’t be bothered to set up the “delta of 2 documents” menu hack. :-)

  4. September 3rd, 2010 at 21:44 | #4

    Rather chuffed that anyone would describe my efforts as Pythonic! What is the “delta of 2 documents menu hack”? I have a morbid fascination with Lotus Notes now that I no longer use it in anger.

  5. September 14th, 2010 at 06:20 | #5

    Hehe, so now I have learned my New English Word Of The Day, thanks to you and the online Merriam-Webster. http://www.merriam-webster.com/dictionary/chuffed

    There is built-in UI functionality in Lotus Notes to compare two documents, it’s in one of the included DLLs. However, it is not accessible from the UI unless you add a line in the .ini file defining a menu item to call the thing. Very weird, very typical Lotus Notes.

  6. Michael
    July 17th, 2011 at 22:45 | #6

    thanks! Also needed a script to extract my mail messages from Lotus…

  1. No trackbacks yet.