Data Wrangling

2022-10-17

Word count: 519 | Reading time≈ 3 min

Data Wrangling Intro

Data wrangling is the process of cleaning,restructuring,
and enriching data.It can turn,or map,
large amounts of raw data into different format
that makes the data more useful for the purposes of consumption
and analysis,by better organizing it.

We've already seen some basic data wrangling in past times. Pretty much any time you use the `|` operator,you are performing some kind of data wrangling.

Regular expressions

Regular expressions are common and useful enough that it’s worthwhile to take some time to understand hwo they work.

. means “any single character” except newline.
* zero or more of the preceding match.
+ one or more of the preceding match.
[a-zA-Z] any single word.
(RX1|RX2) either something that matches RX1 or RX2.
^ the start of the line.
$ the end of the line.

grep

Print lines that match patterns(man grep)

Usage: grep [OPTION..] PATTERNS [FILE…]

sed

Stream editor for filtering and transforming text.
A stream editor is used to perform basic text transformations
on an input stream (a file or input from a pipeline).

s for substitution

The slash as a delimiter.

1 2	date echo $(date) \| sed 's/[0-9]*//g'

Using & as the matched string

1
2
3

sed -r 's/[0-9]+/|&|/g' <(date)
sed 's/[a-zA-Z]*/& &/g' <(echo Hello)
#Hello Hello

Using \1 to keep part of the pattern

the \1 refers to the characters captured by the escaped parentheses.

What does \1 in sed do?

echo 'abc1abc2abc3' | sed 's/\(ab\)c/|\1|/'
# |ab|1abc2abc3 
echo 'abc1abc2abc3' | sed 's/\(ab\)\(c\)/\2-\1/g'
# c-ab1c-ab2c-ab3

awk

awesome awk

Processing workflow

Every AWK execution consist of following three phases:

BEGIN{...} are actions performed at the beginning before the first text character is read.
[condition]{...} are actions performed on every awk record(default text line)
- every awk record is automatically split into awk fields.
END{...} are actions performed at the end of the execution after last text character is read.

Global variables

$0 value of current awk record(whole line without line-break)
- $1,$2…values of first,second…awk filed.
FS Specifies the input awk field separator–how awk breaks input record into fields(default:a whitespace)

Builtin functions

print,printf(),sprintf(),
- printing functions
length()
- length of a string argument
~
- regexp search
substr()
- splitting string to a substring
split()
- split string into an array of strings
index()
- find position of an substring in a string
sub() and gsub()
- (regexp) search and replace (once respectivelly globally)

Example

1	ps \| awk 'BEGIN{print "PID AWK record\n";cnt = 0} $1 ~ "[0-9]+"{printf("%d,%s\n",$1,$0);cnt += $1} END{printf("summary:%d\n",cnt)}'

Analyzing data

You can do math directly in your shell using bc,a calculator that can read from STDIN.

Example

1	ps \| grep -se '[0-9]' \| awk '{print $1}' \| paste -sd + \| bc -l

Copyright： Copyright is owned by the author. For commercial reprints, please contact the author for authorization. For non-commercial reprints, please indicate the source.