The Best Tools to Analyze Alternative Data | Parts 2 & 3: Ingesting and Loading Data

This article continues our series on data tools to analyze alternative data. It will cover Steps 2 & 3 of our list below. As a reminder, the remaining steps will be covered in successive articles over the following weeks.

Step-by-Step Process:

Get Data: copy data from the vendor to where you want to analyze data
Ingest Data: you may need to manipulate raw data for it to be loaded correctly
Load Data: import data into the analysis package
Preprocess Data: clean, filter and transform data for it to be ready for modeling
Modeling: apply your analysis and draw conclusions
Presenting: present your insights and conclusions in a digestible format

PARTS 2 & 3: INGEST & LOAD DATA

Product	Description	Notes & Tips	Cost
Pandas / Dask /Pyspark	Read CSV, Text and Excel files directly without ingesting	If you have CSV, Text or Excel files, you can read them directly without having to ingest them Reading those files directly is easy but you might run into problems if data changes between files, use D6tstack to make sure your data is read correctly Only pandas can read Excel files directly and also not very well, you might want to convert Excel into CSV using d6tstack.convert_xls	$0
d6tstack	Quickly ingest any type of CSV or Excel data	Often vendors send data in multiple files but often add/rename columns, this library helps you clean files for loading, see examples	$0
Postgres / Mysql/ Others	Database for storing data	Optionally you can ingest raw files into a database There’s some overhead involved in setting this up but it’s worthwhile for large datasets or if you want to share data between people If a vendor is nice they might give you sql queries to ingest data	$0
SQuirrel /HeidiSQL	Manage your database using a GUI	Instead of using the command line, graphical tools for managing your database	$0

Why should I care? In Step 1 you obtained data from vendors for you to do your analysis. The data may arrive in various formats and you may need to manipulate the raw data for it to be loaded correctly. Some data you can or want to load directly into the analysis program but if you regularly work with a dataset it makes sense to ingest it into a clean and common file format or database for you to quickly run your analysis on.

What are pandas, dask, pyspark? How do I choose between them? Those are core python data analytics libraries. They all have some pros/cons, mostly related to the size of data you are analyzing. An overview of each is given below, in the upcoming preprocessing section we will go into more detail how to use each of those libraries. Also note that there are graphical and commercial alternatives available for data preparation, data integration and data analysis.

Pandas

Overview: THE python data analysis library. If you’re just getting started, stick with this it take you a long way
Pros: Widely used with lots of support and many extensions available
Cons: Likely to break when you start analyzing >=1GB files because it wants to read all data into your computer memory. It also uses only 1 core of your CPU by default

Dask

Overview: built on top of pandas to make it scale for large datasets. The next best thing after you’ve pushed pandas to its limits
Pros: Good for analyzing large files and intermediate multi-core/distributed computing
Cons: Missing some pandas functionality and works differently from pandas

Pyspark

Overview: once you start analyzing really large datasets TB+ datasets pysparkgives you fully distributed computing
Pros: Scales really well but you can start off in standalone local mode. Well integrated into Hadoop ecosystem with good support
Cons: Steeper learning curve and infrastructure setup. Works very differently from pandas

What are the pros/cons of ingesting data into a database?

Pros:

Everyone can access the data in a central place instead of having a bunch of files stored in all sorts of places
They are optimized for querying data even with large datasets
If you can load data into a database you’ve cleaned it well and understand the dataset

Cons:

Depending on what database you use, in particular SQL databases, are very “opinionated” and expect the data to be in a certain format for it to be loaded
Can’t be (easily) branched if someone wants to experiment with new features
Needs some access control not everyone should have write access

How do I ingest data into a database?
This is a big and complex topic which depends on your database, operating system, open source vs commercial etc. Below is short and by no means comprehensive a list of tools to use with a focus on open source tools. Graphical tools:

DB tools: SQuirrel, HeidiSQL, Mysql workbench, PgAdmin
ETL Tools: Pentaho Community edition, KNIME

Programmatic:

Database commands: CSV files you can import directly, eg mysql LOAD DATAor postgres COPY. This doesn’t work for other file formats so you’ll either need to convert to CSV (eg using d6tstack.convert_xls) or use something else
Python libraries: analytics libraries support export to database eg Pandas.to_sql() or pyspark jdbc or ODO. Some work better than others…

What do I do if my data changes and won’t import into a database or doesn’t load correctly? In particular SQL databases, are very “opinionated” and expect the data to be in a certain format for it to be loaded and changes to the format likely cause import errors. But other tools like dask and pyspark also have trouble if the raw file format changes (see dask example or pyspark example). You will likely have to inspect the changes and make some manual adjustments. Libraries like d6tstack can help you with that and commercial data prep tools also help you deal with this problem.

How can I make regular updates? Same notes and tools apply as for regularly downloading data although the complexity increases noticeably to ensure integrity of the data. As before commercial data prep toolsalso help you deal with this problem.

How do I put my data in version control? Git has large file support or you can try DVC. Commercial data science platforms also do this.

PART 4: PREPROCESS DATA (early August)

The Best Tools to Analyze Alternative Data | Parts 2 & 3: Ingesting and Loading Data

Get the latest