Unlike buying datasets and hiring people, the technology infrastructure required to analyze alternative data can be acquired at minimal to zero cost. We have put together a list of all the free tools you need to analyze alternative data without having to spend a lot. These resources can be used individually to perform a particular function or together to form a comprehensive technology stack to operate your entire data effort.

The list walks you through the alternative data analysis process step-by-step. For each step, you have the key tools including links to their documentation required for you to complete that step. We will publish each step in a succession of articles.The final product will give you the resources needed to analyze alternative data at a cost of $0!

Step-by-Step Process:
  1. Get Data: copy data from the vendor to where you want to analyze data
  2. Ingest Data: you may need to manipulate raw data for it to be loaded correctly
  3. Load Data: import data into the analysis package
  4. Preprocess Data: clean, filter and transform data for it to be ready for modeling
  5. Modeling: apply your analysis and draw conclusions
  6. Presenting: present your insights and conclusions in a digestible format
TEACH-IN:  Manipulating Alternative Data with Python
Augvest & AlternativeData.org

Taught by Norman Niemer, Chief Data Scientist at UBS Asset Management (QED). Hands-on technical session on using python for ingesting, preprocessing and analyzing data. Will also address common problems encountered with alternative data like data schema changes and mismatched identifiers between vendors. Designed for both beginners as well as experienced coders.
  • Thursday, July 19, 2018  |  6:00pm
  • New York City
  • RSVP HERE
Part 1: Get Data
Goal Tool Notes/Tips   Cost   
AWS S3 file download

S3 browser(graphical)

AWS CLI(programmatic)
$0
Vendor API data download

Postman(graphical)

Requests(programmatic)
$0
FTP file download

Winscp(graphical)

Pyftpsync(programmatic)
$0
Corporate proxy authentication Cntlm
  • If your corporate proxy requires authentication, this tool will make your life easier
$0


Why should I care?
So you’ve found a dataset on AlternativeData.org, the vendor gave you access and you are ready to leverage alternative data! The very first step to analyzing alternative data is to get your hands on the data. Unless you’re working with small sample data that fits in an email, you will like need to use one of these tools. With them, you will be able to obtain the raw data needed to run your analysis.

I’m a total beginner! How do I learn python and how do I install it? If you’ve never used python before, we recommend you install Anaconda python and learn to code in python using many of the excellent free/affordable resources such: Note: you can get by without python for "getting data", but you will definitely need python later on when you want to process and analyze data.

How do I know which tool to use?
What tool you need to use depends on where the data is located. The vendor will give you access credentials that typically look like one of the below:

     AWS S3 Files
     access key: AKIAIOSFODNN7EXAMPLE
     secret key: wJalrXUtnFEMIK7MDENGbPxRfiCYEXAMPLEKEY
     Bucket: s3://nameofbucket/folder/

     Vendor API
     Endpoint: http://vendor.com/api/v1
     Username: usr
     Password: pwd

     FTP
     IP/Endpoint: ftp.vendor.com
     Folder: /data/
     Username: usr
     Password: pwd

Note: these just cover the most common scenarios, for other cases you should consult your vendor.

I’ve never used any of those tools, how do I get started?
Follow the instructions and documentation at the links in the list. Where available, there are is a quickstart link in the notes. Also, attend the Teach-In on July 18th.

I’ve installed the tool, entered vendor information but get a “Cannot connect” or “Connection timed out” error - what do I do?
This is probably the biggest problem you will encounter! First double and triple check that you’ve entered the right information: check user name, password, server name, server port etc. Some vendors restrict access by IP address. Go back to the vendor and double check you have entered the correct information.

Assuming you have entered the correct information, the most likely cause for this error is that you have not configured the use of a proxy server. In large corporates, you typically don’t have direct internet access but need to connect via a proxy server which means you need to give the tools your proxy server information. Either the tools have explicit settings for you to enter proxy information (see e.g. winscp proxy settings). If they do not have an explicit setting, you should configure environment variables HTTP_PROXY and HTTPS_PROXY with the IP addresses for your proxy servers which you can obtain from your IT department. See this or this to learn how to set up environmental variables. You might need to use cntlm if your proxy requires authentication. You may need to restart your computer for the settings to become active.

Even if you have configured the proxy server correctly, you will still get an “Cannot connect to server” error. Large companies might block direct ftp or S3 access. You can ask your IT support whether you can access ftp and Aws S3 via your proxy server. If the answer is no, unless you can find a creative way around it, you will have to get your IT department involved for you to get files.

How can I make regular updates?
You can use the programmatic versions of the links to regularly pull files in an automated fashion. For simple tasks you can create batch files (windows) or shell scripts (linux) and schedule them for automated execution in Task Scheduler (windows) or cron (linux). To manage more complex automated data pipes you should look at Airflow or Luigi.
PART 2: INGEST DATA (next week)
 

Get the latest

Join over 4,000 investors from the top hedge
funds and long-only asset managers.