Word wrap: more adventures with sed
October 15, 2008 – 9:15 pmWhen I first pasted the transcript of C. Scott’s talk, it was ugly and unwrapped and splurted off the right edge of my screen. So I decided to learn more regular expressions. This is what happened.
Attempt #1: Mrah. why does cat foo.txt | sed ’s/(.{1,80}( +|$\n?)|(.{1,80})/\1\3\n/g’ not work? \3 is not an invalid reference, you silly thing.
D’oh moment #1: Oh. escaping { and ( with \ is needed with sed. now…
Attempt #2: why doesn’t cat foo.txt | sed ’s/\(.\{1,80\}\)\( +|$\n?\)|\(.\{1,80\}\)/\1\3\n/g’ work?
Attempt #3 (gave up): Ok, simplifying… cat foo.txt | sed ’s/\(.\{1,80\}\)/\1\n/g’ will work, but not split across word boundaries when possible. Good enough for me (for now).
<br />#!/bin/sh<br /><br />if [ $# -ne 1 ]; then<br /> echo Usage: wrap file.txt<br /> exit 127<br />fi<br /><br />cat $1 | sed 's/\(.\{1,80\}\)/\1\n/g'<br />exit 0<br />
Translation from bahasa geek, for parents and other noncoders: I was trying to come up with a way to quickly wrap long lines of text (the equivalent of hitting Enter to put a newline in the middle of long sentences so that they’ll fit onto a page).
Sed is a stream editor that I used to do this, and I wrote a shell script (#!/bin/sh) that fed (cat) out the file I gave the script ($1 – which is a variable that I’d replace with the file I wanted to wrap, when the script was actually run). This replaces (s/) instances of an 80-character group, \(.\{1,80\}\) – the escaped parentheses, \( and \), say this is a group of characters, the period says “any non-whitespace character,” and the “1, 80″ inside the escaped brackets says “up to 80 of them all together.”
It replaces it with that same 80-character group (using \1, which is a backreference – basically, “the first group that you talked about here,” which is what we were doing with the parentheses – and then a newline, \n, which is like the Enter key. Basically, this script has the same effect as me going through a text document the following way:
- hit the right arrow key 80 times
- hit Enter
- repeat steps 1 and 2 until you reach the end of the document
It’s a dumb script because it’ll break in the middle of words but I couldn’t figure out how to make it pay attention to word boundaries. I tried – see above attempts – but gave up because I didn’t have time to chase the answer down. I’m hoping that somebody reading this might be able to spot where I went wrong. Halp?
6 Responses to “Word wrap: more adventures with sed”
If you want it to only break at word boundaries, wouldn’t putting a space at the end of the first part of your match expression do that?
sed ’s/\(.\{1,80\}\) /\1\n/g’ foo.txt
By L33tminion on Oct 16, 2008
why doesn’t cat foo.txt | sed ’s/\(.\{1,80\}\)\( +|$\n?\)|\(.\{1,80\}\)/\1\3\n/g’ work?
Apparently, both + and ? must be escaped as \+ and \?.
By L33tminion on Oct 16, 2008
And just for kicks, here’s a Python recipe for the same thing.
By L33tminion on Oct 16, 2008
I think `fmt` will do what you want, much simpler.
fmt –help
By skierpage on Oct 18, 2008
I use vim for regular expressions. Despite introducing its own quirks it highlights matches, undo is easy, you can edit previous replacements, and its quirks aren’t too bad.
I marked the start and end of the transcript with ma and mb, from there on I operated on the range using :’a,’b followed by the substitution command.
:’a,’bs/\n\(\n<mchua\)/\1/
this finds a newline followed by (a newline + <mchua). The parentheses make it remember the newline + mchua, so I can replace with \1. That gets rid of much whitespace. But it stuck too close together, so I undid it.
I got your original version from history, and then
:’a,’bs/^Oct 15 ..:..:.. \t//
cleaned it all up, then I ran fmt –split-only
Other vim stuff:
xp deletes character under cursor then puts the deleted character back. So it fixes ‘teh’
J joins two lines, then x if necessary to get rid of the space at the join.
is hard to read. Maybe I should have inserted at the end of each, so it appears as regular text.
Thanks for transcribing!
By skierpage on Oct 18, 2008
I just used fmt to format more IRC logs, and lo, it was gorgeous. Thanks, skierpage!
By Mel on Nov 12, 2008