The Best Tools to Analyze Alternative Data | Parts 2 & 3: Ingesting and Loading Data
This article continues our series on data tools to analyze alternative data. It will cover Steps 2 & 3 of our list below. As a reminder, the remaining steps will be covered in...
This article continues our series on data tools to analyze alternative data. It will cover Steps 2 & 3 of our list below. As a reminder, the remaining steps will be covered in successive articles over the following weeks.
Step-by-Step Process:
Step-by-Step Process:
- Get Data: copy data from the vendor to where you want to analyze data
- Ingest Data: you may need to manipulate raw data for it to be loaded correctly
- Load Data: import data into the analysis package
- Preprocess Data: clean, filter and transform data for it to be ready for modeling
- Modeling: apply your analysis and draw conclusions
- Presenting: present your insights and conclusions in a digestible format
|
|
Why should I care? In Step 1 you obtained data from vendors for you to do your analysis. The data may arrive in various formats and you may need to manipulate the raw data for it to be loaded correctly. Some data you can or want to load directly into the analysis program but if you regularly work with a dataset it makes sense to ingest it into a clean and common file format or database for you to quickly run your analysis on.
What are pandas, dask, pyspark? How do I choose between them? Those are core python data analytics libraries. They all have some pros/cons, mostly related to the size of data you are analyzing. An overview of each is given below, in the upcoming preprocessing section we will go into more detail how to use each of those libraries. Also note that there are graphical and commercial alternatives available for data preparation, data integration and data analysis.
Pandas
- Overview: THE python data analysis library. If you’re just getting started, stick with this it take you a long way
- Pros: Widely used with lots of support and many extensions available
- Cons: Likely to break when you start analyzing >=1GB files because it wants to read all data into your computer memory. It also uses only 1 core of your CPU by default
- Overview: built on top of pandas to make it scale for large datasets. The next best thing after you’ve pushed pandas to its limits
- Pros: Good for analyzing large files and intermediate multi-core/distributed computing
- Cons: Missing some pandas functionality and works differently from pandas
- Overview: once you start analyzing really large datasets TB+ datasets pysparkgives you fully distributed computing
- Pros: Scales really well but you can start off in standalone local mode. Well integrated into Hadoop ecosystem with good support
- Cons: Steeper learning curve and infrastructure setup. Works very differently from pandas
Pros:
- Everyone can access the data in a central place instead of having a bunch of files stored in all sorts of places
- They are optimized for querying data even with large datasets
- If you can load data into a database you’ve cleaned it well and understand the dataset
- Depending on what database you use, in particular SQL databases, are very “opinionated” and expect the data to be in a certain format for it to be loaded
- Can’t be (easily) branched if someone wants to experiment with new features
- Needs some access control not everyone should have write access
This is a big and complex topic which depends on your database, operating system, open source vs commercial etc. Below is short and by no means comprehensive a list of tools to use with a focus on open source tools. Graphical tools:
- DB tools: SQuirrel, HeidiSQL, Mysql workbench, PgAdmin
- ETL Tools: Pentaho Community edition, KNIME
- Database commands: CSV files you can import directly, eg mysql LOAD DATAor postgres COPY. This doesn’t work for other file formats so you’ll either need to convert to CSV (eg using d6tstack.convert_xls) or use something else
- Python libraries: analytics libraries support export to database eg Pandas.to_sql() or pyspark jdbc or ODO. Some work better than others…
What do I do if my data changes and won’t import into a database or doesn’t load correctly? In particular SQL databases, are very “opinionated” and expect the data to be in a certain format for it to be loaded and changes to the format likely cause import errors. But other tools like dask and pyspark also have trouble if the raw file format changes (see daskexample or pyspark example). You will likely have to inspect the changes and make some manual adjustments. Libraries like d6tstack can help you with that and commercial data prep tools also help you deal with this problem.
How can I make regular updates? Same notes and tools apply as for regularly downloading data although the complexity increases noticeably to ensure integrity of the data. As before commercial data prep toolsalso help you deal with this problem.
How do I put my data in version control? Git has large file support or you can try DVC. Commercial data science platforms also do this.
PART 4: PREPROCESS DATA (early August)