SourceForge VA Linux Systems
Copyright © 2000 Paul Sheer - Click here for copying permissions       Source by FTP

next up previous contents index
Next: Processes and environment variables Up: Rute Users Tutorial and Previous: Shell Scripting   Contents   Index

Subsections

Streams and sed -- the stream editor

Add comments here 

Introduction

11.1The commands grep, echo, df and so on print some output to the screen. In fact, what is happening on a lower level is that they are printing characters one by one into a theoretical data stream (also called a pipe) called the stdout pipe. The shell itself performs the action of reading those characters one by one and displaying them on the screen. The word pipe itself means exactly that: a program places data in the one end of a funnel while another program reads that data from the other end. The reason for pipes is to allow two seperate programs to perform simple communications with each other. In this case, the program is merely communicating with the shell in order to display some output.
The same is true with the cat command explained previously. This command run with no arguments reads from the stdin pipe. By default this is the keyboard. One further pipe is the stderr pipe which a program writes error messages to. It is not possible to see whether a program message is caused by the program writing to its stderr or stdout pipe, because usually both are directed to the screen. Good programs however always write to the appropriate pipes to allow output to be specially separated for diagnostic purposes if need be.

Tutorial

Create a text file with lots of lines that contain the word GNU and one line that contains the word GNU as well the word Linux. Then do grep GNU myfile.txt. The result is printed to stdout as usual. Now try grep GNU myfile.txt > gnu_lines.txt. What is happening here is that the output of the grep command is being redirected into a file. The > gnu_lines.txt tells the shell to create a new file gnu_lines.txt and fill it with any output from stdout, instead of displaying the output as it usually does. If the file already exists, it will be truncated11.2.
Now suppose you want to append further output to this file. Using >> instead of > will not truncate the file but append any output to it. Try this: echo "morestuff" >> gnu_lines.txt. Then view the contents of gnu_lines.txt.

Piping using | notation

The real power of pipes is when one program can read from the output of another program. Consider the grep command which reads from stdin when given no arguments: run grep with one argument on the command line:

 
 
 
 
5 
 
 
 
# grep GNU
A line without that word in it
Another line without that word in it
A line with the word GNU in it
A line with the word GNU in it
I have the idea now
^C
#

grep's default is to read from stdin when no files are given. As you can see, it is doing its usual work of printing out lines that have the word GNU in them. Hence lines containing GNU will be printed twice - as you type them in and again when grep reads them and decides that they contain GNU.
Now try grep GNU myfile.txt | grep Linux. The first grep outputs all lines with the word GNU in them to stdout. The | tells that all stdout is to be typed as stdin (us we just did above) into the next command, which is also a grep command. The second grep command scans that data for lines with the word Linux in them. grep is often used this way as a filter11.3 and be used multiple times eg. grep L myfile.txt | grep i | grep n | grep u | grep x.

A complex piping example

In a previous chapter we used grep on a dictionary to demonstrate regular expressions. This is how a dictionary of words can be created:

 
 
cat /usr/lib/ispell/english.hash | strings | tr 'A-Z' 'a-z' \
| grep '^[a-z]' | sort -u > mydict

11.4The file english.hash contains the UNIX dictionary normally used for spell checking. With a bit of filtering you can create a dictionary that will make solving crossword puzzles a breese. First we use the command strings explained previously to extract readable bits of text. Here we are using its alternate mode of operation where it reads from stdin when no files are specified on its command-line. The command tr (abbreviated from translate see the tr man page.) then converts upper to lower case. The grep command then filters out lines that do not start with a letter. Finally the sort command sorts the words in alphabetical order. The -u option stands for unique, and specifies that there should be not duplicate lines of text. Now try less mydict.

Redirecting streams with >&

Try the command ls nofile.txt > A. ls should give an error message if the file doesn't exist. The error message is however displayed, and not written into the file A. This is because ls has written its error message to stderr while > has only redirected stdout. The way to get both stdout and stderr to both go to the same file is to use a redirection operator. As far as the shell is concerned, stdout is called 1 and stderr is called 2, and commands can be appended with a redirection like 2>&1 to dictate that stderr is to be mixed into the output of stdout. The actual words stderr and stdout are only used in C programming. Try the following:

 
 
 
touch existing_file
rm -f non-existing_file
ls existing_file non-existing_file

ls will output two lines: a line containing a listing for the file existing_file and a line containing an error message to explain that the file non-existing_file does not exist. The error message would have been written to stderr or file descriptor number 2, and the remaining line would have been written to stdout or file descriptor number 1. Next we try

 
 
ls existing_file non-existing_file 2>A
cat A

Now A contains the error message, while the remaining output came to the screen. Now try,

 
 
ls existing_file non-existing_file 1>A
cat A

The notation 1>A is the same as >A because the shell assumes that you are referring to file descriptor 1 when you don't specify any. Now A contains the stdout output, while the error message has been redirected to the screen. Now try,

 
 
ls existing_file non-existing_file 1>A 2>&1
cat A

Now A contains both the error message and the normal output. The >& is called a redirection operator. x>&y tells to write pipe x into pipe y. Redirection is specified from right too left on the command line. Hence the above command means to mix stderr into stdout and then to redirect stdout to the file A. Finally,

 
 
ls existing_file non-existing_file 2>A 1>&2
cat A

We notice that this has the same effect, except that here we are doing the reverse: redirecting stdout into stderr, and then redirecting stderr into a file A. To see what happens if we redirect in reverse order, we can try,

 
 
ls existing_file non-existing_file 2>&1 1>A
cat A

which means to redirect stdout into a file A, and then to redirect stderr into stdout. This will therefore not mix stderr and stdout because the redirection to A came first.

Using sed to edit streams

ed used to be the standard text editor for UNIX. It is cryptic to use, but is compact and programmable. sed stands for stream editor, and is the only incarnation of ed that is commonly used today. sed allows editing of files non-interactively. In the way that grep can search for words and filter lines of text; sed can do search-replace operations and insert and delete lines into text files. sed is one of those programs with no man page to speek of. Do info sed to see sed's comprehensive info pages with examples. The most common usage of sed is to replace words in a stream with alternative words. sed reads from stdin and writes to stdout. Like grep, it is line buffered which means that it reads one line in at a time and then writes that line out again after performing whatever editing operations. Replacements are typically done with:

 
 
cat <file> | sed -e 's/<search-regexp>/<replace-text>/<option>' \
> <resultfile>

where search-regexp is a regular expression, replace-text is the text you would like to replace each found occurance with, and option is nothing or g, which means to replace every occurance in the same line (usually sed just replaces the first occurance of the regular expression in each line). (There are other options, see the sed info page.) For demonstration, type

 
sed -e 's/e/E/g'

and type out a few lines of english text.

sed is actually an extremely powerful and important system of editing. A complete overview will be done later. Here we will concentrate on searching and replacing regular expressions.

Regular expression sub-exressions

The section explains how to do the apparently complex task of moving text around within lines. Consider for example the output of ls: now say you want to automatically strip out only the size column -- sed can do this sort of editing using the special \( \) notation to group parts of the regular expression together. Consider the following example:

 
sed -e 's/\(\<[^ ]*\>\)\([ ]*\)\(\<[^ ]*\>\)/\3\2\1/g'

Here sed is searching for the expression \<.*\>[ ]*\<.*\>. From the chapter on regular expressions, we can see that it matches a whole word, an arbitrary amount of whitespace, and then another whole word. The \( \) groups these three so that they can be referred to in replace-text. Each part of the regular expression inside \( \) is called a sub-expression of the regular expresion. Each sub-expression is numbered -- namely \1, \2 etc. Hence \1 in replace-text is the first \<[^ ]*\>, \2 is [ ]*, and finally, \3 is the second \<[^ ]*\>. Now test to see what happens when you run this:

 
 
 
sed -e 's/\(\<[^ ]*\>\)\([ ]*\)\(\<[^ ]*\>\)/\3\2\1/g'
GNU Linux is cool
Linux GNU cool is

To return to our ls example (note that this is just an example, to count file sizes you should rather use the du command), think about if we would like to sum the bytes sizes of all the files in a directory:

 
 
expr 0 `ls -l | grep '^-' | \
  sed 's/^\([^ ]*[ ]*\)\\{4,4\\}\([0-9]*\).*$/ + \2/'`

We know that ls -l output lines start with - for ordinary files. So we use grep to strip lines not starting with -. If we do an ls -l, we see the output is divided into four columns of stuff we are not interested in, and then a number indicating the size of the file. A column (or field) can be described by the regular expression [^ ]*[ ]*, i.e. a length of text with no whitespace, followed by a length of whitespace. There are four of these, so we bracket it with \( \), and then use the \{ \} notation to indicate that we want exactly 4. After that comes our number [0-9]*, and then any trailing characters which we are not interested in, .*$. Notice here that we have neglected to use \< \> notation to indicate whole words. This is because sed tries to match the maximum number of characters legally allowed, and in the situation we have here, has exactly the same effect.

If you haven't yet figured it out, we are trying to get that column of bytes sizes into the format like,

 
 
 
 
+ 438
+ 1525
+ 76
+ 92146

... so that expr can understand it. Hence we replace each line with sub-expression \2 and a leading + sign. Backquotes give the output of this to expr, which sums them studiously, ignoring any newline characters as though the summation were typed in on a single line. There is one minor problem here: the first line contains a + with nothing before it, which will cause expr to complain. To get around this, we can just add a 0 to the expression, so that it becomes 0 + ....


next up previous contents index
Next: Processes and environment variables Up: Rute Users Tutorial and Previous: Shell Scripting   Contents   Index
Paul Sheer 2000-10-07