Building your own corpus – cleaning up your text file(s)

One of the issues with crawling web pages is the irrelevant text that is also collected. This short post describes one way to clean up your text file(s). [Update 1: note if you use ICEWeb instead of TextSTAT you won’t need to clean up as much, if at all; update 2: BootCat also does not require much cleaning]. I assume you are using LibreOffice, I think one can find similar functions or extensions for Microsoft Office.

The extension you need is Alternative dialog Find & Replace for Writer. Load the extension by going to Tools>Extension Manager>Add. Then restart LibreOffice. You will then see the extension under Tools>Add-Ons>Alternative searching.

You need to look at your text and see if you can spot patterns to the parts of the text you do not need. For my case I had to do five passes on the text. The first pass looked like this:


I get the [::BigBlock::] function by selecting Extended>Series of paragraphs (limited between start and end marks). This means See Also[::BigBlock::]File Under will find all text starting from See Also and ending with File Under. Note that you need to have the Regular expressions checkbox ticked. You can cycle through the selections by pressing Find to verify that is the text you want. Once you have done that you can simply press Replace All button.

My next pass looked like this:


My third pass was this:


Note the use of | as the OR operator.

My fourth pass removed any odd characters e.g.:


My fifth pass looked to remove any photo; image attributions so text like “photo by” or “image by” etc.

Of course depending on how thorough you want to clean your text one or two passes may be enough.

Stay tuned for related posts.

