Linux Shell Scripting – various recipes to help you

Exclusive offer: get 50% off this eBook here
Linux Shell Scripting Cookbook

Linux Shell Scripting Cookbook — Save 50%

Solve real-world shell scripting problems with over 110 simple but incredibly effective recipes

$26.99    $13.50
by Shantanu Tushar | September 2013 | Linux Servers Open Source

In this article by Shantanu Tushar and Sarath Lakshman, authors of Linux Shell Scripting Cookbook, Second Edition, we will cover the following recipes:

  • Using regular expressions
  • Searching and mining text inside a file with grep
  • Cutting a file column-wise with cut
  • Using sed to perform text replacement
  • Using awk for advanced text processing
  • Finding frequency of words used in a given file
  • Using head and tail for printing the last or first 10 lines
  • Counting the number of lines, words, and characters in a file

(For more resources related to this topic, see here.)

The shell scripting language is packed with all the essential problem-solving components for Unix/Linux systems. Text processing is one of the key areas where shell scripting is used, and there are beautiful utilities such as sed, awk, grep, and cut, which can be combined to solve problems related to text processing.

Various utilities help to process a file in fine detail of a character, line, word, column, row, and so on, allowing us to manipulate a text file in many ways. Regular expressions are the core of pattern-matching techniques, and most of the text-processing utilities come with support for it. By using suitable regular expression strings, we can produce the desired output, such as filtering, stripping, replacing, and searching.

Using regular expressions

Regular expressions are the heart of text-processing techniques based on pattern matching. For fluency in writing text-processing tools, one must have a basic understanding of regular expressions. Using wild card techniques, the scope of matching text with patterns is very limited. Regular expressions are a form of tiny, highly-specialized programming language used to match text. A typical regular expression for matching an e-mail address might look like [a-z0-9_]+@[a-z0-9]+\.[a-z]+.

If this looks weird, don't worry, it is really simple once you understand the concepts through this recipe.

How to do it...

Regular expressions are composed of text fragments and symbols, which have special meanings. Using these, we can construct any suitable regular expression string to match any text according to the context. As regex is a generic language to match texts, we are not introducing any tools in this recipe.

Let's see a few examples of text matching:

  • To match all words in a given text, we can write the regex as follows:

    ( ?[a-zA-Z]+ ?)

    ? is the notation for zero or one occurrence of the previous expression, which in this case is the space character. The [a-zA-Z]+ notation represents one or more alphabet characters (a-z and A-Z).

  • To match an IP address, we can write the regex as follows:

    [0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}

    Or:

    [[:digit:]]{1,3}\.[[:digit:]]{1,3}\.[[:digit:]]{1,3}\.[[:digit:]]{1,3}

    We know that an IP address is in the form 192.168.0.2. It is in the form of four integers (each from 0 to 255), separated by dots (for example, 192.168.0.2).

    [0-9] or [:digit:] represents a match for digits from 0 to 9. {1,3} matches one to three digits and \. matches the dot character (.).

    This regex will match an IP address in the text being processed. However, it doesn't check for the validity of the address. For example, an IP address of the form 123.300.1.1 will be matched by the regex despite being an invalid IP. This is because when parsing text streams, usually the aim is to only detect IPs.

How it works...

Let's first go through the basic components of regular expressions (regex):

regex

Description

Example

^

This specifies the start of the line marker.

^tux matches a line that starts with tux.

$

This specifies the end of the line marker.

tux$ matches a line that ends with tux.

.

This matches any one character.

Hack. matches Hack1, Hacki, but not Hack12 or Hackil; only one additional character matches.

[]

This matches any one of the characters enclosed in [chars].

coo[kl] matches cook or cool.

[^]

This matches any one of the characters except those that are enclosed in [^chars].

9[^01] matches 92 and 93, but not 91 and 90.

[-]

This matches any character within the range specified in [].

[1-5] matches any digits from 1 to 5.

?

This means that the preceding item must match one or zero times.

colou?r matches color or colour, but not colouur.

+

This means that the preceding item must match one or more times.

Rollno-9+ matches Rollno-99 and Rollno-9, but not Rollno-.

*

This means that the preceding item must match zero or more times.

co*l matches cl, col, and coool.

()

This treats the terms enclosed as one entity

ma(tri)?x matches max or matrix.

{n}

This means that the preceding item must match n times.

[0-9]{3} matches any three-digit number. [0-9]{3} can be expanded as [0-9][0-9][0-9].

{n,}

This specifies the minimum number of times the preceding item should match.

[0-9]{2,} matches any number that is two digits or longer.

{n, m}

This specifies the minimum and maximum number of times the preceding item should match.

[0-9]{2,5} matches any number has two digits to five digits.

|

This specifies the alternation-one of the items on either of sides of | should match.

Oct (1st | 2nd) matches Oct 1st or Oct 2nd.

\

This is the escape character for escaping any of the special characters mentioned previously.

a\.b matches a.b, but not ajb. It ignores the special meaning of . because of \.

For more details on the regular expression components available, you can refer to the following URL:

http://www.linuxforu.com/2011/04/sed-explained-part-1/

There's more...

Let's see how the special meanings of certain characters are specified in the regular expressions.

Treatment of special characters

Regular expressions use some characters, such as $, ^, ., *, +, {, and }, as special characters. But, what if we want to use these characters as normal text characters? Let's see an example of a regex, a.txt.

This will match the character a, followed by any character (due to the '.' character), which is then followed by the string txt . However, we want '.' to match a literal '.' instead of any character. In order to achieve this, we precede the character with a backward slash \ (doing this is called escaping the character). This indicates that the regex wants to match the literal character rather than its special meaning. Hence, the final regex becomes a\.txt.

Visualizing regular expressions

Regular expressions can be tough to understand at times, but for people who are good at understanding things with diagrams, there are utilities available to help in visualizing regex. Here is one such tool that you can use by browsing to http://www.regexper.com; it basically lets you enter a regular expression and creates a nice graph to help understand it. Here is a screenshot showing the regular expression we saw in the previous section:

Searching and mining a text inside a file with grep

Searching inside a file is an important use case in text processing. We may need to search through thousands of lines in a file to find out some required data, by using certain specifications. This recipe will help you learn how to locate data items of a given specification from a pool of data.

How to do it...

The grep command is the magic Unix utility for searching in text. It accepts regular expressions, and can produce output in various formats. Additionally, it has numerous interesting options. Let's see how to use them:

  1. To search for lines of text that contain the given pattern:

    $ grep pattern filename
    this is the line containing pattern

    Or:

    $ grep "pattern" filename
    this is the line containing pattern

  2. We can also read from stdin as follows:

    $ echo -e "this is a word\nnext line" | grep word
    this is a word

  3. Perform a search in multiple files by using a single grep invocation, as follows:

    $ grep "match_text" file1 file2 file3 ...

  4. We can highlight the word in the line by using the --color option as follows:

    $ grep word filename --color=auto
    this is the line containing word

  5. Usually, the grep command only interprets some of the special characters in match_text. To use the full set of regular expressions as input arguments, the -E option should be added, which means an extended regular expression. Or, we can use an extended regular expression enabled grep command, egrep. For example:

    $ grep -E "[a-z]+" filename

    Or:

    $ egrep "[a-z]+" filename

  6. In order to output only the matching portion of a text in a file, use the -o option as follows:

    $ echo this is a line. | egrep -o "[a-z]+\." line.

  7. In order to print all of the lines, except the line containing match_pattern, use:

    $ grep -v match_pattern file

    The -v option added to grep inverts the match results.

  8. Count the number of lines in which a matching string or regex match appears in a file or text, as follows:

    $ grep -c "text" filename 10

    It should be noted that -c counts only the number of matching lines, not the number of times a match is made. For example:

    $ echo -e "1 2 3 4\nhello\n5 6" | egrep -c "[0-9]" 2

    Even though there are six matching items, it prints 2, since there are only two matching lines. Multiple matches in a single line are counted only once.

  9. To count the number of matching items in a file, use the following trick:

    $ echo -e "1 2 3 4\nhello\n5 6" | egrep -o "[0-9]" | wc -l 6

  10. Print the line number of the match string as follows:

    $ cat sample1.txt gnu is not unix linux is fun bash is art $ cat sample2.txt planetlinux $ grep linux -n sample1.txt 2:linux is fun

    or

    $ cat sample1.txt | grep linux -n

    If multiple files are used, it will also print the filename with the result as follows:

    $ grep linux -n sample1.txt sample2.txt sample1.txt:2:linux is fun sample2.txt:2:planetlinux

  11. Print the character or byte offset at which a pattern matches, as follows:

    $ echo gnu is not unix | grep -b -o "not" 7:not

    The character offset for a string in a line is a counter from 0, starting with the first character. In the preceding example, not is at the seventh offset position (that is, not starts from the seventh character in the line; that is, gnu is not unix).

    The -b option is always used with -o.

  12. To search over multiple files, and list which files contain the pattern, we use the following:

    $ grep -l linux sample1.txt sample2.txt sample1.txt sample2.txt

    The inverse of the -l argument is -L. The -L argument returns a list of non-matching files.

There's more...

We have seen the basic usages of the grep command, but that's not it; the grep command comes with even more features. Let's go through those.

Recursively search many files

To recursively search for a text over many directories of descendants, use the following command:

$ grep "text" . -R -n

In this command, "." specifies the current directory.

The options -R and -r mean the same thing when used with grep.

For example:

$ cd src_dir $ grep "test_function()" . -R -n ./miscutils/test.c:16:test_function();

test_function() exists in line number 16 of miscutils/test.c.

This is one of the most frequently used commands by developers. It is used to find files in the source code where a certain text exists.

Ignoring case of pattern

The -i argument helps match patterns to be evaluated, without considering the uppercase or lowercase. For example:

$ echo hello world | grep -i "HELLO" hello

grep by matching multiple patterns

Usually, we specify single patterns for matching. However, we can use an argument -e to specify multiple patterns for matching, as follows:

$ grep -e "pattern1" -e "pattern"

This will print the lines that contain either of the patterns and output one line for each match. For example:

$ echo this is a line of text | grep -e "this" -e "line" -o this line

There is also another way to specify multiple patterns. We can use a pattern file for reading patterns. Write patterns to match line-by-line, and execute grep with a -f argument as follows:

$ grep -f pattern_filesource_filename

For example:

$ cat pat_file hello cool $ echo hello this is cool | grep -f pat_file hello this is cool

Including and excluding files in a grep search

grep can include or exclude files in which to search. We can specify include files or exclude files by using wild card patterns.

To search only for .c and .cpp files recursively in a directory by excluding all other file types, use the following command:

$ grep "main()" . -r --include *.{c,cpp}

Note, that some{string1,string2,string3} expands as somestring1 somestring2 somestring3.

Exclude all README files in the search, as follows:

$ grep "main()" . -r --exclude "README"

To exclude directories, use the --exclude-dir option.

To read a list of files to exclude from a file, use --exclude-from FILE.

Using grep with xargs with zero-byte suffix

The xargs command is often used to provide a list of file names as a command-line argument to another command. When filenames are used as command-line arguments, it is recommended to use a zero-byte terminator for the filenames instead of a space terminator. Some of the filenames can contain a space character, and it will be misinterpreted as a terminator, and a single filename may be broken into two file names (for example, New file.txt can be interpreted as two filenames New and file.txt). This problem can be avoided by using a zero-byte suffix. We use xargs so as to accept a stdin text from commands such as grep and find. Such commands can output text to stdout with a zero-byte suffix. In order to specify that the input terminator for filenames is zero byte (\0), we should use -0 with xargs.

Create some test files as follows:

$ echo "test" > file1 $ echo "cool" > file2 $ echo "test" > file3

In the following command sequence, grep outputs filenames with a zero-byte terminator (\0), because of the -Z option with grep. xargs -0 reads the input and separates filenames with a zero-byte terminator:

$ grep "test" file* -lZ | xargs -0 rm

Usually, -Z is used along with -l.

Silent output for grep

Sometimes, instead of actually looking at the matched strings, we are only interested in whether there was a match or not. For this, we can use the quiet option (-q), where the grep command does not write any output to the standard output. Instead, it runs the command and returns an exit status based on success or failure.

We know that a command returns 0 on success, and non-zero on failure.

Let's go through a script that makes use of grep in a quiet mode, for testing whether a match text appears in a file or not.

#!/bin/bash #Filename: silent_grep.sh #Desc: Testing whether a file contain a text or not if [ $# -ne 2 ]; then echo "Usage: $0 match_text filename" exit 1 fi match_text=$1 filename=$2 grep -q "$match_text" $filename if [ $? -eq 0 ]; then echo "The text exists in the file" else echo "Text does not exist in the file" fi

The silent_grep.sh script can be run as follows, by providing a match word (Student) and a file name (student_data.txt) as the command argument:

$ ./silent_grep.sh Student student_data.txt The text exists in the file

Printing lines before and after text matches

Context-based printing is one of the nice features of grep. Suppose a matching line for a given match text is found, grep usually prints only the matching lines. But, we may need "n" lines after the matching line, or "n" lines before the matching line, or both. This can be performed by using context-line control in grep. Let's see how to do it.

In order to print three lines after a match, use the -A option:

$ seq 10 | grep 5 -A 3 5 6 7 8

In order to print three lines before the match, use the -B option:

$ seq 10 | grep 5 -B 3 2 3 4 5

Print three lines after and before the match, and use the -C option as follows:

$ seq 10 | grep 5 -C 3 2 3 4 5 6 7 8

If there are multiple matches, then each section is delimited by a line "--":

$ echo -e "a\nb\nc\na\nb\nc" | grep a -A 1 a b -- a b

Cutting a file column-wise with cut

We may need to cut the text by a column rather than a row. Let's assume that we have a text file containing student reports with columns, such as Roll, Name, Mark, and Percentage. We need to extract only the name of the students to another file or any nth column in the file, or extract two or more columns. This recipe will illustrate how to perform this task.

How to do it...

cut is a small utility that often comes to our help for cutting in column fashion. It can also specify the delimiter that separates each column. In cut terminology, each column is known as a field .

  1. To extract particular fields or columns, use the following syntax:

    cut -f FIELD_LIST filename

    FIELD_LIST is a list of columns that are to be displayed. The list consists of column numbers delimited by commas. For example:

    $ cut -f 2,3 filename

    Here, the second and the third columns are displayed.

  2. cut can also read input text from stdin.

    Tab is the default delimiter for fields or columns. If lines without delimiters are found, they are also printed. To avoid printing lines that do not have delimiter characters, attach the -s option along with cut. An example of using the cut command for columns is as follows:

    $ cat student_data.txt No Name Mark Percent 1 Sarath 45 90 2 Alex 49 98 3 Anu 45 90 $ cut -f1 student_data.txt No 1 2 3

  3. Extract multiple fields as follows:

    $ cut -f2,4 student_data.txt Name Percent Sarath 90 Alex 98 Anu 90

  4. To print multiple columns, provide a list of column numbers separated by commas as arguments to -f.
  5. We can also complement the extracted fields by using the --complement option. Suppose you have many fields and you want to print all the columns except the third column, then use the following command:

    $ cut -f3 --complement student_data.txt No Name Percent 1 Sarath 90 2 Alex 98 3 Anu 90

  6. To specify the delimiter character for the fields, use the -d option as follows:

    $ cat delimited_data.txt No;Name;Mark;Percent 1;Sarath;45;90 2;Alex;49;98 3;Anu;45;90 $ cut -f2 -d";" delimited_data.txt Name Sarath Alex Anu

There's more

The cut command has more options to specify the character sequences to be displayed as columns. Let's go through the additional options available with cut.

Specifying the range of characters or bytes as fields

Suppose that we don't rely on delimiters, but we need to extract fields in such a way that we need to define a range of characters (counting from 0 as the start of line) as a field. Such extractions are possible with cut.

Let's see what notations are possible:

N-

from the Nth byte, character, or field, to the end of line

N-M

from the Nth to Mth (included) byte, character, or field

-M

from first to Mth (included) byte, character, or field

We use the preceding notations to specify fields as a range of bytes or characters with the following options:

  • -b for bytes
  • -c for characters
  • -f for defining fields

For example:

$ cat range_fields.txt abcdefghijklmnopqrstuvwxyz abcdefghijklmnopqrstuvwxyz abcdefghijklmnopqrstuvwxyz abcdefghijklmnopqrstuvwxy

You can print the first to fifth characters as follows:

$ cut -c1-5 range_fields.txt abcde abcde abcde abcde

The first two characters can be printed as follows:

$ cut range_fields.txt -c -2 ab ab ab ab

Replace -c with -b to count in bytes.

We can specify the output delimiter while using with -c, -f, and -b, as follows:

--output-delimiter "delimiter string"

When multiple fields are extracted with -b or -c, the --output-delimiter is a must. Otherwise, you cannot distinguish between fields if it is not provided. For example:

$ cut range_fields.txt -c1-3,6-9 --output-delimiter "," abc,fghi abc,fghi abc,fghi abc,fghi

Linux Shell Scripting Cookbook Solve real-world shell scripting problems with over 110 simple but incredibly effective recipes
Published: January 2011
eBook Price: $26.99
Book Price: $44.99
See more
Select your format and quantity:

Using sed to perform text replacement

sed stands for stream editor . It is a very essential tool for text processing, and a marvelous utility to play around with regular expressions. A well-known usage of the sed command is for text replacement. This recipe will cover most of the frequently-used sed techniques.

How to do it…

sed can be used to replace occurrences of a string with another string in a given text.

  1. It can be matched using regular expressions.

    $ sed 's/pattern/replace_string/' file Or:

    $ cat file | sed 's/pattern/replace_string/'

    This command reads from stdin.

    If you use the vi editor, you will notice that the command to replace the text is very similar to the one discussed here.

  2. By default, sed only prints the substituted text. To save the changes along with the substitutions to the same file, use the -i option. Most of the users follow multiple redirections to save the file after making a replacement as follows:

    $ sed 's/text/replace/' file >newfile $ mv newfile file

    However, it can be done in just one line; for example:

    $ sed -i 's/text/replace/' file

  3. These usages of the sed command will replace the first occurrence of the pattern in each line. If we want to replace every occurrence, we need to add the g parameter at the end, as follows:

    $ sed 's/pattern/replace_string/g' file

    The /g suffix means that it will substitute every occurrence. However, we sometimes need to replace only the Nth occurrence onwards. For this, we can use the /Ng form of the option.

    Have a look at the following commands:

    $ echo thisthisthisthis | sed 's/this/THIS/2g' thisTHISTHISTHIS $ echo thisthisthisthis | sed 's/this/THIS/3g' thisthisTHISTHIS $ echo thisthisthisthis | sed 's/this/THIS/4g' thisthisthisTHIS

    We have used / in sed as a delimiter character. We can use any delimiter characters as follows:

    sed 's:text:replace:g' sed 's|text|replace|g'

    When the delimiter character appears inside the pattern, we have to escape it using the \ prefix, as follows:

    sed 's|te\|xt|replace|g'

    \| is a delimiter appearing in the pattern replaced with escape.

There's more...

The sed command comes with numerous options for text manipulation. By combining the options available with sed in logical sequences, many complex problems can be solved in one line. Let's see the different options available with sed.

Removing blank lines

Removing blank lines is a simple technique by using sed to remove blank lines. Blanks can be matched with regular expression ^$:

$ sed '/^$/d' file

/pattern/d will remove lines matching the pattern.

For blank lines, the line end marker appears next to the line start marker.

Performing replacement directly in the file

When a filename is passed to sed, it usually prints its output to stdout. Instead, if we want it to actually modify the contents of the file, we use the -i option, as follows:

$ sed 's/PATTERN/replacement/' -i filename

For example, replace all three-digit numbers with another specified number in a file, as follows:

$ cat sed_data.txt 11 abc 111 this 9 file contains 111 11 88 numbers 0000 $ sed -i 's/\b[0-9]\{3\}\b/NUMBER/g' sed_data.txt $ cat sed_data.txt 11 abc NUMBER this 9 file contains NUMBER 11 88 numbers 0000

The preceding one-liner replaces three-digit numbers only. \b[0-9]\{3\}\b is the regular expression used to match three-digit numbers. [0-9] is the range of digits; that is, from 0 to 9. {3} is used for matching the preceding character thrice. \ in \{3\} is used to give a special meaning for { and }. \b is the word boundary marker.

It's a useful practice to first try the sed command without -i to make sure your regex is correct, and once you are satisfied with the result, add the -i option to actually make changes to the file. Alternatively, you can use the following form of sed:

sed -i .bak 's/abc/def/' file

In this case, sed will not only perform the replacement on the file, but it will also create a file called file.bak, which will contain the original contents.

Matched string notation (&)

In sed, we can use & as the matched string for the substitution pattern, in such a way that we can use the matched string in the replacement string.

For example:

$ echo this is an example | sed 's/\w\+/[&]/g' [this] [is] [an] [example]

Here, regex \w\+ matches every word. Then, we replace it with [&]. & corresponds to the word that is matched.

Substring match notation (\1)

& corresponds to the matched string for the given pattern. We can also match the substrings of the given pattern. Let's see how to do it.

$ echo this is digit 7 in a number | sed 's/digit \([0-9]\)/\1/' this is 7 in a number

The preceding command replaces digit 7 with 7. The substring matched is 7. \(pattern\) is used to match the substring. The pattern is enclosed in (), and is escaped with slashes. For the first substring match, the corresponding notation is \1; for the second, it is \2, and so on. Go through the following example with multiple matches:

$ echo seven EIGHT | sed 's/\([a-z]\+\) \([A-Z]\+\)/\2 \1/' EIGHT seven

([a-z]\+\) matches the first word, and \([A-Z]\+\) matches the second word. \1 and \2 are used for referencing them. This type of referencing is called back referencing . In the replacement part, their order is changed as \2 \1 and, hence, it appears in reverse order.

Combination of multiple expressions

The combination of multiple sed using a pipe can be replaced as follows:

sed 'expression' | sed 'expression'

The preceding command is equivalent to the following:

$ sed 'expression; expression'

Or:

$ sed -e 'expression' -e expression'

For example,

$ echo abc | sed 's/a/A/' | sed 's/c/C/' AbC $ echo abc | sed 's/a/A/;s/c/C/' AbC $ echo abc | sed -e 's/a/A/' -e 's/c/C/' AbC

Quoting

Usually, it is seen that the sed expression is quoted by using single quotes. But, double quotes can also be used. Double quotes expand the expression by evaluating it. Using double quotes are useful when we want to use a variable string in a sed expression.

For example:

$ text=hello $ echo hello world | sed "s/$text/HELLO/" HELLO world

$text is evaluated as hello.

Using awk for advanced text processing

awk is a tool designed to work with data streams. It is very interesting, as it can operate on columns and rows. It supports many built-in functionalities, such as arrays and functions, in the C programming language. Its biggest advantage is its flexibility.

Getting ready...

The structure of an awk script is as follows:

awk ' BEGIN{ print "start" } pattern { commands } END{ print "end" } file

The awk command can read from stdin also.

An awk script usually consists of three parts—BEGIN, END, and a common statements block with the pattern match option. The three of them are optional and any of them can be absent in the script.

How to do it…

Let's write a simple awk script enclosed in single quotes or double quotes, as follows:

awk 'BEGIN { statements } { statements } END { end statements }'

Or, alternately, use the following command:

awk "BEGIN { statements } { statements } END { end statements }"

For example:

$ awk 'BEGIN { i=0 } { i++ } END{ print i}' filename

Or:

$ awk "BEGIN { i=0 } { i++ } END{ print i }" filename

How it works…

The awk command works in the following manner:

  1. Execute the statements in the BEGIN { commands } block.
  2. Read one line from the file or stdin, and execute pattern { commands }. Repeat this step until the end of the file is reached.
  3. When the end of the input stream is reached, execute the END { commands } block.

The BEGIN block is executed before awk starts reading lines from the input stream. It is an optional block. The statements, such as variable initialization and printing the output header for an output table, are common statements that are written in the BEGIN block.

The END block is similar to the BEGIN block. It gets executed when awk completes reading all the lines from the input stream. The statements, such as printing results after analyzing all the values calculated for all the lines or printing the conclusion are the commonly-used statements in the END block (for example, after comparing all the lines, print the maximum number from a file). This is an optional block.

The most important block is of the common commands with the pattern block. This block is also optional. If this block is not provided, by default { print } gets executed so as to print each of the lines read. This block gets executed for each line read by awk. It is like a while loop for lines read, with statements provided inside the body of the loop.

When a line is read, it checks whether the provided pattern matches the line. The pattern can be a regular expression match, conditions, range of lines match, and so on. If the current read line matches with the pattern, it executes the statements enclosed in { }.

The pattern is optional. If it is not used, all the lines are matched and the statements inside { } are executed.

Let's go through the following example:

$ echo -e "line1\nline2" |
awk 'BEGIN{ print "Start" } { print } END{ print "End" } '
Start line1 line2 End

When print is used without an argument, it will print the current line. There are two important things to be kept in mind about it. When the arguments of the print are separated by commas, they are printed with a space delimiter. Double quotes are used as the concatenation operator in the context of print in awk.

For example:

$ echo | awk '{ var1="v1"; var2="v2"; var3="v3"; \
print var1,var2,var3 ; }'

The preceding statement will print the values of the variables as follows:

v1 v2 v3

The echo command writes a single line into the standard output. Hence, the statements in the { } block of awk are executed once. If the standard input to awk contains multiple lines, the commands in awk will be executed multiple times.

Concatenation can be used as follows:

$ echo | awk '{ var1="v1"; var2="v2"; var3="v3"; \ print var1 "-" var2 "-" var3 ; }'

The output will be as follows:

v1-v2-v3

{ } is like a block in a loop, iterating through each line of a file.

Usually, we place initial variable assignments, such as var=0; and like statements, print the file header in the BEGIN block. In the END{} block, we place statements such as printing results.

There's more…

The awk command comes with a lot of rich features. In order to master the art of awk programming, you should be familiar with the important awk options and functionalities. Let's go through the essential functionalities of awk.

Special variables

Some special variables that can be used with awk are as follows:

  • NR: It stands for the current record number, which corresponds to the current line number when it uses lines as records
  • NF: It stands for the number of fields, and corresponds to the number of fields in the current record under execution (fields are delimited by space)
  • $0: It is a variable that contains the text content of the current line under execution
  • $1: It is a variable that holds the text of the first field
  • $2: It is the variable that holds the text of the second field

For example:

$ echo -e "line1 f2 f3\nline2 f4 f5\nline3 f6 f7" | \ awk '{ print "Line no:"NR",No of fields:"NF, "$0="$0, "$1="$1,"$2="$2,"$3="$3 }' Line no:1,No of fields:3 $0=line1 f2 f3 $1=line1 $2=f2 $3=f3 Line no:2,No of fields:3 $0=line2 f4 f5 $1=line2 $2=f4 $3=f5 Line no:3,No of fields:3 $0=line3 f6 f7 $1=line3 $2=f6 $3=f7

We can print the last field of a line as print $NF, last but the second as $(NF-1), and so on.

awk also provides the printf() function with the same syntax as in C. We can also use that instead of print.

Let's see some basic awk usage examples. Print the second and third field of every line as follows:

$awk '{ print $3,$2 }' file

In order to count the number of lines in a file, use the following command:

$ awk 'END{ print NR }' file

Here, we only use the END block. NR will be updated on entering each line by awk with its line number. When it reaches the end of the line, it will have the value of the last line number. Hence, in the END block NR will have the value of the last line number.

You can sum up all the numbers from each line of field 1 as follows:

$ seq 5 | awk 'BEGIN{ sum=0; print "Summation:" } { print $1"+"; sum+=$1 } END { print "=="; print sum }' Summation: 1+ 2+ 3+ 4+ 5+ == 15

Passing an external variable to awk

By using the -v argument, we can pass external values other than stdin to awk, as follows:

$ VAR=10000 $ echo | awk -v VARIABLE=$VAR '{ print VARIABLE }' 10000

There is a flexible alternate method to pass many variable values from outside awk. For example:

$ var1="Variable1" ; var2="Variable2" $ echo | awk '{ print v1,v2 }' v1=$var1 v2=$var2 Variable1 Variable2

When an input is given through a file rather than standard input, use the following command:

$ awk '{ print v1,v2 }' v1=$var1 v2=$var2 filename

In the preceding method, variables are specified as key-value pairs, separated by a space and (v1=$var1 v2=$var2 ) as command arguments to awk soon after the BEGIN, { }, and END blocks.

Reading a line explicitly using getline

Usually, awk reads all the lines in a file by default. If you want to read one specific line, you can use the getline function. Sometimes, you may need to read the first line from the BEGIN block.

The syntax is getline var. The variable var will contain the content for the line. If getline is called without an argument, we can access the content of the line by using $0, $1, and $2.

For example:

$ seq 5 | awk 'BEGIN { getline; print "Read ahead first line", $0 } { print $0 }' Read ahead first line 1 2 3 4 5

Filtering lines processed by awk with filter patterns

We can specify some conditions for lines to be processed. For example:

$ awk 'NR < 5' # first four lines $ awk 'NR==1,NR==4' #First four lines $ awk '/linux/' # Lines containing the pattern linux (we can specify regex) $ awk '!/linux/' # Lines not containing the pattern linux

Setting delimiter for fields

By default, the delimiter for fields is a space. We can explicitly specify a delimiter by using -F "delimiter":

$ awk -F: '{ print $NF }' /etc/passwd

Or:

awk 'BEGIN { FS=":" } { print $NF }' /etc/passwd

We can set the output fields separator by setting OFS="delimiter" in the BEGIN block.

Reading the command output from awk

In the following code, echo will produce a single blank line. The cmdout variable will contain the output of the command grep root /etc/passwd, and it will print the line containing the root:

The syntax for reading out the command in a variable output is as follows:

"command" | getline output ;

For example:

$ echo | awk '{ "grep root /etc/passwd" | getline cmdout ; print cmdout }' root:x:0:0:root:/root:/bin/bash

By using getline, we can read the output of external shell commands in a variable called cmdout.

awk supports associative arrays, which can use the text as the index.

Using loop inside awk

A for loop is available in awk. It has the following format:

for(i=0;i<10;i++) { print $i ; }

Or:

for(i in array) { print array[i]; }

String manipulation functions in awk

awk comes with many built-in string manipulation functions. Let's have a look at a few of them:

  • length(string): This returns the string length.
  • index(string, search_string): This returns the position at which search_string is found in the string.
  • split(string, array, delimiter): This stores the list of strings generated by using the delimiter in the array.
  • substr(string, start-position, end-position): This returns the substring created from the string by using the start and end character offsets.
  • sub(regex, replacement_str, string): This replaces the first occurring regular expression match from the string with replacment_str.
  • gsub(regex, replacment_str, string): This is similar to sub(), but it replaces every regular expression match.
  • match(regex, string): This returns the result of whether a regular expression (regex) match is found in the string or not. It returns a non-zero output if a match is found, otherwise it returns zero. Two special variables are associated with match(). They are RSTART and RLENGTH. The RSTART variable contains the position at which the regular expression match starts. The RLENGTH variable contains the length of the string matched by the regular expression.

Using head and tail for printing the last or first 10 lines

When looking into a large file, which consists of thousands of lines, we will not use a command such as cat to print the entire file contents. Instead we look for a sample (for example, the first 10 lines of the file or the last 10 lines of the file). We may need to print the first n lines or last n lines and even print all the lines except the last n lines or all lines except first n lines.

Another use case is to print lines from m th to n th lines.

The commands head and tail can help us do this.

How to do it...

The head command always reads the header portion of the input file.

  1. Print the first 10 lines as follows:

    $ head file

  2. Read the data from stdin as follows:

    $ cat text | head

  3. Specify the number of first lines to be printed as follows:

    $ head -n 4 file

    This command prints four lines.

  4. Print all lines excluding the last M lines as follows:

    $ head -n -M file

    Note that it is negative M.

    For example, to print all the lines except the last five lines, use the following command line:

    $ seq 11 | head -n -5 1 2 3 4 5 6

    The following command will, however, print from 1 to 5:

    $ seq 100 | head -n 5

  5. Printing by excluding the last lines is a very important usage of head. Now, let us see how to print, last few lines. Print the last 10 lines of a file as follows:

    $ tail file

  6. In order to read from stdin, you can use the following command line:

    $ cat text | tail

  7. Print the last five lines as follows:

    $ tail -n 5 file

  8. In order to print all lines excluding the first M lines, use the following code:

    $ tail -n +(M+1)

For example, to print all lines except the first five lines, M + 1 = 6, therefore the command will be as follows:

$ seq 100 | tail -n +6

This will print from 6 to 100.

One of the important usages of tail is to read a constantly growing file. Since new lines are constantly appended to the end of the file, tail can be used to display all new lines as they are written to the file. When we run tail without specifying any options, it will read the last 10 lines and exit. However, by that time, new lines would have been appended to the file by a process. To constantly monitor the growth of file, tail has a special option -f or --follow, which enables tail to follow the appended lines and keep being updated as data is added:

$ tail -f growing_file

You will probably want to use this on logfiles. The command to monitor the growth of the files would be:

# tail -f /var/log/messages

Or:

$ dmesg | tail -f

We frequently run dmesg to look at kernel ring buffer messages either to debug the USB devices or to look at sdX (X is the minor number for the sd device corresponding to a SCSI disk). The -f tail can also add a sleep interval -s, so that we can set the interval during which the file updates are monitored.

tail has the interesting property that allows it to terminate after a given process ID dies.

Suppose we are reading a growing file, and a process Foo is appending data to the file, the -f tail should be executed until the process Foo dies.

$ PID=$(pidof Foo) $ tail -f file --pid $PID

When the process Foo terminates, tail also terminates.

Let us work on an example.

  1. Create a new file file.txt and open the file in gedit (you can use any text editor).
  2. Add new lines to the file and make frequent file saves in the gedit.
  3. Now run the following commands:

    $ PID=$(pidof gedit) $ tail -f file.txt --pid $PID

When you make frequent changes to the file, it will be written to the terminal by the tail command. When you close the gedit, the tail command will get terminated.

Counting the number of lines, words, and characters in a file

Counting the number of lines, words, and characters from a text file are very useful for text manipulations. In several cases, these counts are used in indirect ways to perform some hacks in order to produce the required output patterns and results. This book includes some tricky examples in other chapters. Counting LOC ( Lines of Code ) is a very important application for developers. We may need to count special types of files excluding unnecessary files. A combination of wc with other commands help to perform that.

wc is the utility used for counting. It stands for word count . Let us see how to use wc to count lines, words, and characters.

How to do it...

We can use various options for wc to count the number of lines, words, and characters:

  1. Count the number of lines in the following manner:

    $ wc -l file

  2. To use stdin as input, use the following command:

    $ cat file | wc -l

  3. Count the number of words as follows:

    $ wc -w file $ cat file | wc -w

  4. In order to count the number of characters, use the following commands:

    $ wc -c file $ cat file | wc -c

    For example, we can count the characters in a text as follows:

    echo -n 1234 | wc -c 4

    -n is used to avoid an extra newline character.

  5. To print the number of lines, words, and characters, execute wc without any options:

    $ wc file 1435 15763 112200

    Those are the number of lines, words, and characters respectively.

  6. Print the length of the longest line in a file using the -L option:

    $ wc file -L 205

Resources for Article :


Further resources on this subject:


Linux Shell Scripting Cookbook Solve real-world shell scripting problems with over 110 simple but incredibly effective recipes
Published: January 2011
eBook Price: $26.99
Book Price: $44.99
See more
Select your format and quantity:

About the Author :


Sarath Lakshman

Sarath Lakshman is a 23 year old who was bitten by the Linux bug during his teenage years. He is a software engineer working in ZCloud engineering group at Zynga, India. He is a life hacker who loves to explore innovations. He is a GNU/Linux enthusiast and hactivist of free and open source software. He spends most of his time hacking with computers and having fun with his great friends. Sarath is well known as the developer of SLYNUX (2005)—a user friendly GNU/Linux distribution for Linux newbies. The free and open source software projects he has contributed to are PiTiVi Video editor, SLYNUX GNU/Linux distro, Swathantra Malayalam Computing, School-Admin, Istanbul, and the Pardus Project. He has authored many articles for the Linux For You magazine on various domains of FOSS technologies. He had made a contribution to several different open source projects during his multiple Google Summer of Code projects. Currently, he is exploring his passion about scalable distributed systems in his spare time. Sarath can be reached via his website http://www.sarathlakshman.com.

Shantanu Tushar

Shantanu Tushar is an advanced GNU/Linux user since his college days. He works as an application developer and contributes to the software in the KDE projects.

Shantanu has been fascinated by computers since he was a child, and spent most of his high school time writing C code to perform daily activities. Since he started using GNU/Linux, he has been using shell scripts to make the computer do all the hard work for him. He also takes time to visit students at various colleges to introduce them to the power of Free Software, including its various tools. Shantanu is a well-known contributor in the KDE community and works on Calligra, Gluon and the Plasma subprojects. He looks after maintaining Calligra Active – KDE's office document viewer for tablets, Plasma Media Center, and the Gluon Player. One day, he believes, programming will be so easy that everybody will love to write programs for their computers.

Shantanu can be reached by e-mail on shantanu@kde.org, shantanutushar on identi.ca/twitter, or his website http://www.shantanutushar.com.

Books From Packt


Kali Linux Cookbook
Kali Linux Cookbook

Web Penetration Testing with Kali Linux
Web Penetration Testing with Kali Linux

Instant Kali Linux [Instant]
Instant Kali Linux [Instant]

 Linux Mint System Administrator’s Beginner's Guide
Linux Mint System Administrator’s Beginner's Guide

Arch Linux Environment Setup How-to [Instant]
Arch Linux Environment Setup How-to [Instant]

Linux Shell Scripting Cookbook
Linux Shell Scripting Cookbook

CentOS 6 Linux Server Cookbook
CentOS 6 Linux Server Cookbook

Scalix: Linux Administrator's Guide
Scalix: Linux Administrator's Guide


Code Download and Errata
Packt Anytime, Anywhere
Register Books
Print Upgrades
eBook Downloads
Video Support
Contact Us
Awards Voting Nominations Previous Winners
Judges Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software
Resources
Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software