This article continues our series on data tools to analyze alternative data. It will cover Steps 2 & 3 of our list below. As a reminder, the remaining steps will be covered in successive articles over the following weeks.

Step-by-Step Process:
  1. Get Data: copy data from the vendor to where you want to analyze data
  2. Ingest Data: you may need to manipulate raw data for it to be loaded correctly
  3. Load Data: import data into the analysis package
  4. Preprocess Data: clean, filter and transform data for it to be ready for modeling
  5. Modeling: apply your analysis and draw conclusions
  6. Presenting: present your insights and conclusions in a digestible format
PARTS 2 & 3: INGEST & LOAD DATA
Product Description  Notes & Tips   Cost   
Pandas / Dask /Pyspark

Read CSV, Text and Excel files directly without ingesting

  • If you have CSV, Text or Excel files, you can read them directly without having to ingest them
  • Reading those files directly is easy but you might run into problems if data changes between files, use D6tstack to make sure your data is read correctly
  • Only pandas can read Excel files directly and also not very well, you might want to convert Excel into CSV using d6tstack.convert_xls
$0
d6tstack

Quickly ingest any type of CSV or Excel data

  • Often vendors send data in multiple files but often add/rename columns, this library helps you clean files for loading, see examples
$0
Postgres / MysqlOthers

Database for storing data

  • Optionally you can ingest raw files into a database
  • There’s some overhead involved in setting this up but it’s worthwhile for large datasets or if you want to share data between people
  • If a vendor is nice they might give you sql queries to ingest data
$0
SQuirrel /HeidiSQL Manage your database using a GUI
  • Instead of using the command line, graphical tools for managing your database
$0  


Why should I care? In Step 1 you obtained data from vendors for you to do your analysis. The data may arrive in various formats and you may need to manipulate the raw data for it to be loaded correctly. Some data you can or want to load directly into the analysis program but if you regularly work with a dataset it makes sense to ingest it into a clean and common file format or database for you to quickly run your analysis on.

What are pandas, dask, pyspark? How do I choose between them? Those are core python data analytics libraries. They all have some pros/cons, mostly related to the size of data you are analyzing. An overview of each is given below, in the upcoming preprocessing section we will go into more detail how to use each of those libraries. Also note that there are graphical and commercial alternatives available for data preparationdata integration and data analysis.   

Pandas
  • Overview: THE python data analysis library. If you’re just getting started, stick with this it take you a long way
  • Pros: Widely used with lots of support and many extensions available
  • Cons: Likely to break when you start analyzing >=1GB files because it wants to read all data into your computer memory. It also uses only 1 core of your CPU by default
  Dask
  • Overview: built on top of pandas to make it scale for large datasets. The next best thing after you’ve pushed pandas to its limits
  • Pros: Good for analyzing large files and intermediate multi-core/distributed computing
  • Cons: Missing some pandas functionality and works differently from pandas
  Pyspark
  • Overview: once you start analyzing really large datasets TB+ datasets pysparkgives you fully distributed computing
  • Pros: Scales really well but you can start off in standalone local mode. Well integrated into Hadoop ecosystem with good support
  • Cons: Steeper learning curve and infrastructure setup. Works very differently from pandas
What are the pros/cons of ingesting data into a database?  

Pros:
  • Everyone can access the data in a central place instead of having a bunch of files stored in all sorts of places
  • They are optimized for querying data even with large datasets
  • If you can load data into a database you’ve cleaned it well and understand the dataset
  Cons:
  • Depending on what database you use, in particular SQL databases, are very “opinionated” and expect the data to be in a certain format for it to be loaded
  • Can’t be (easily) branched if someone wants to experiment with new features
  • Needs some access control not everyone should have write access
How do I ingest data into a database?
This is a big and complex topic which depends on your database, operating system, open source vs commercial etc. Below is short and by no means comprehensive a list of tools to use with a focus on open source tools. Graphical tools:  Programmatic: 
  • Database commands: CSV files you can import directly, eg mysql LOAD DATAor postgres COPY. This doesn’t work for other file formats so you’ll either need to convert to CSV (eg using d6tstack.convert_xls) or use something else
  • Python libraries: analytics libraries support export to database eg Pandas.to_sql() or pyspark jdbc or ODO. Some work better than others…

What do I do if my data changes and won’t import into a database or doesn’t load correctly? In particular SQL databases, are very “opinionated” and expect the data to be in a certain format for it to be loaded and changes to the format likely cause import errors. But other tools like dask and pyspark also have trouble if the raw file format changes (see daskexample or pyspark example). You will likely have to inspect the changes and make some manual adjustments. Libraries like d6tstack can help you with that and commercial data prep tools also help you deal with this problem.

How can I make regular updates? Same notes and tools apply as for regularly downloading data although the complexity increases noticeably to ensure integrity of the data. As before commercial data prep toolsalso help you deal with this problem.

How do I put my data in version control? Git has large file support or you can try DVCCommercial data science platforms also do this.

PART 4: PREPROCESS DATA (early August)