Data Wrangling

Data Wrangling Intro

Data wrangling is the process of cleaning,restructuring,
and enriching data.It can turn,or map,
large amounts of raw data into different format
that makes the data more useful for the purposes of consumption
and analysis,by better organizing it.

We've already seen some basic data wrangling in past times. Pretty much any time you use the `|` operator,you are performing some kind of data wrangling.

Regular expressions

Regular expressions are common and useful enough that it’s worthwhile to take some time to understand hwo they work.

  • . means “any single character” except newline.
  • * zero or more of the preceding match.
  • + one or more of the preceding match.
  • [a-zA-Z] any single word.
  • (RX1|RX2) either something that matches RX1 or RX2.
  • ^ the start of the line.
  • $ the end of the line.

grep

Print lines that match patterns(man grep)

Usage: grep [OPTION..] PATTERNS [FILE…]

sed

Stream editor for filtering and transforming text.
A stream editor is used to perform basic text transformations
on an input stream (a file or input from a pipeline).

s for substitution

The slash as a delimiter.

1
2
date 
echo $(date) | sed 's/[0-9]*//g'

Using & as the matched string

1
2
3
sed -r 's/[0-9]+/|&|/g' <(date)
sed 's/[a-zA-Z]*/& &/g' <(echo Hello)
#Hello Hello

Using \1 to keep part of the pattern

the \1 refers to the characters captured by the escaped parentheses.

What does \1 in sed do?

1
2
3
4
echo 'abc1abc2abc3' | sed 's/\(ab\)c/|\1|/'
# |ab|1abc2abc3
echo 'abc1abc2abc3' | sed 's/\(ab\)\(c\)/\2-\1/g'
# c-ab1c-ab2c-ab3

awk

awesome awk

Processing workflow

Every AWK execution consist of following three phases:

  • BEGIN{...} are actions performed at the beginning before the first text character is read.
  • [condition]{...} are actions performed on every awk record(default text line)
    • every awk record is automatically split into awk fields.
  • END{...} are actions performed at the end of the execution after last text character is read.


Global variables

  • $0 value of current awk record(whole line without line-break)
    • $1,$2…values of first,second…awk filed.
  • FS Specifies the input awk field separator–how awk breaks input record into fields(default:a whitespace)

Builtin functions

  • print,printf(),sprintf(),
    • printing functions
  • length()
    • length of a string argument
  • ~
    • regexp search
  • substr()
    • splitting string to a substring
  • split()
    • split string into an array of strings
  • index()
    • find position of an substring in a string
  • sub() and gsub()
    • (regexp) search and replace (once respectivelly globally)

Example

1
ps | awk 'BEGIN{print "PID     AWK record\n";cnt = 0} $1 ~ "[0-9]+"{printf("%d,%s\n",$1,$0);cnt += $1} END{printf("summary:%d\n",cnt)}'

Analyzing data

You can do math directly in your shell using bc,a calculator that can read from STDIN.

Example

1
ps | grep -se '[0-9]' | awk '{print $1}' | paste -sd + | bc -l
  • Copyright: Copyright is owned by the author. For commercial reprints, please contact the author for authorization. For non-commercial reprints, please indicate the source.
  • Copyrights © 2022-2023 Ataraxia

请我喝杯咖啡吧~