Introduction:
The motivation for this post comes from the everyday data wrangling that happens in my research group. Scientific research comes along with handling of a lot of data, which could either be from a sensor array of a multi-million dollar Data Acquisition System (DAQ) or from a week long numerical simulation from a High-performance computing (HPC) systems. Some kind of post processing is almost always required to make sense of the data. Scripting Languages (Python, MATLAB etc) really shine in the task of handling multiple formats (binary files, txt files, csv files etc etc) and post processing to produce meaningful (that’s what we are looking for right?) results.
To simplify things, I am going to work on a big data set which was obtained from a sensor array. The file is available on my github page and can be downloaded, along with the script, here.
The file is a 13.2MB txt file containing 86000 lines and 8 columns of sensor data. The first column is the time column going from 0 seconds to 43 seconds, with timestep of 0.0005 seconds. The task is to load this txt file and plot the different sensor data vs time with basic matplotlib 2D line plot. I want to preserve the column oriented format and work on columns as a whole which is something similar to what Excel does with its column wise data.
time s1 s2 s3 s4 s5 s6 s7
0.000000000000 11.762732505798 12.584483146667 11.091325759888 10.376955986023 5.015242576599 0.667141735554 0.852049231529
0.000500000000 12.026422500610 12.613877296448 11.282106399536 10.450716018677 4.941900253296 0.623193979263 0.881391227245
One of the things some people like about MATLAB is the automatic loading of different mathematics oriented packages. In python, most of the times you have to import numpy/scipy/pandas to do even the most basic data wrangling and plotting. Many of my colleagues feel import x from y are extra steps and they want to get onto the job ASAP! To keep everyone happy, I am going to import the sensor data:
- Without importing any module
- Using Numpy
- Using Pandas
1. Without importing any module
Being so used to using Numpy and Pandas, it was actually a fun challenge to use inbuilt data structures to import and work on data.
The readlines method, to read the lines of the text file, is perfect for our task.
file_name = "sensor_data.txt"
with open(file_name,'r') as f:
lines_of_data = f.readlines()[1:]
I defined the file name (file_name) in the first line and the script will search for that name in my working directory. It is important to keep the data file in the same working directory as the script. Obviously, you can keep the file in some other folder, but then you need to use the absolute path (something like c:/Hello/World/sensor_data.txt) or the relative path.
For the second line, try to read it like simple English.
With open -ing the file_name as a r ead-only file using the name f.
Simple, isn’t it? Let’s go over it once again!
The second line tells Python to open the file as a file object under the name f. I will use this with open snippet a lot of times as it is the standard and most used way to open files in python.
This file object contains multiple functions/methods which are use to access and manipulate data files. One such function comes in the third line as readlines. The f.readlines method returns a list of lines in the row oriented format (1 line/row at a time!)
2. Using Numpy
One of my favourite libraries, Numpy, adds MATLAB like support for large multi-dimensional arrays to Python. It also contains many mathematical functions to work on these arrays along with “better” file handling. I am going to use two different file reading methods to import the sensor data file using Numpy.
a) Using loadtxt
import numpy as np
data_matrix = np.loadtxt(file_name, dtype='float',skiprows=1)
data_matrix=data_matrix.T
Numpy’s loadtxt is the simplest way to import the file which provides some argumental control to the user.
- The dtype argument changes the data type of all the values in the file to float.
- Most of the time the top few lines are generally some test information containing test name, test date, column names etcetc. In our file, the top line contains the column names ’t s1 s2 s3 s4 s5 s6 s7’ The skiprows arguments allows us to skip the top informational line and start reading the actual data from the second line.
- Taking the transpose using data_matrix.T, converts the row wise matrix into a column wise matrix.
b) Using genfromtext
>>> import numpy as np
>>> data_matrix_2 = np.genfromtxt(file_name,dtype='float',names=True)
>>> print(data_matrix_2['time'])
One line to rule them all?
genfromtxt is a faster way to import data which contains a very important argument.
The names argument looks at the first line of the file and uses it to populate the column names. In our file, the first line contains ‘time s1 s2 s3 s4 s5 s6 s7’, and can be used as column names for the imported data. The columns can be accessed by using the column names, such as in the third line using data_matrix_2[‘time’]. Accessing the columns by their names makes manipulation and plotting of data super easy and intuitive, especially for seasoned Excel users.
3. Using Pandas
Cute name for such an awesome library! (The name comes from PANel DAta)
Pandas has a super powerful Dataframe object which takes imported data and and populates the rows and columns similar to Excel/SPSS. This dataframe object is similar/motivated from the R’s fundamental data structure, the data frames. Pandas has so many fancy methods like data alignment, reshaping, slicing, indexing, merging, pivoting etcetcetc, which makes working on big datasets such a breeze.
>>> import pandas as pd
>>> df = pd.read_csv(file_name,
>>> delim_whitespace=True,
>>> #names=('time', 's1', 's2', 's3', 's4', 's5', 's6', 's7'),
>>> dtype={'time': 'float', 's1': 'float', 's2': 'float',
>>> 's3': 'float', 's4': 'float', 's5': 'float',
>>> 's6': 'float', 's7': 'float'})
The read_csv method is the standard method to import the data from wide variety of data formats. read_csv has about 49 arguments which give user superb control over the imported data. I use these few arguments in my example:
- file_name: is the file name of the sensor data set
- delim_whitespace: Our sensor data is separated by a single space, so I use this argument to tell the dataframe about it. You can use different separators and delimiters and define them either under delim_whitespace or delimiter .
- names: This argument populates the name for your columns. I have commented the line as we already have a top line containing column headers. If your file doesn’t have a column header, you can uncomment the line and pass your headers here.
- dtype: Assigns the datatype to the data in various columns.
Once you have imported your data using any of the above mentioned ways, you can start working on data manipulation and plotting. More on that is covered in the next posts.
Hope this post helped you in importing your awesome data!