The Best Tools to Analyze Alternative Data | Part 1: Get Data
Unlike buying datasets and hiring people, the technology infrastructure required to analyze alternative data can be acquired at minimal to zero cost. We have put together a list of all the...
Unlike buying datasets and hiring people, the technology infrastructure required to analyze alternative data can be acquired at minimal to zero cost. We have put together a list of all the free tools you need to analyze alternative data without having to spend a lot. These resources can be used individually to perform a particular function or together to form a comprehensive technology stack to operate your entire data effort.
The list walks you through the alternative data analysis process step-by-step. For each step, you have the key tools including links to their documentation required for you to complete that step. We will publish each step in a succession of articles.The final product will give you the resources needed to analyze alternative data at a cost of $0!
Step-by-Step Process:
Why should I care?
So you’ve found a dataset on AlternativeData.org, the vendor gave you access and you are ready to leverage alternative data! The very first step to analyzing alternative data is to get your hands on the data. Unless you’re working with small sample data that fits in an email, you will like need to use one of these tools. With them, you will be able to obtain the raw data needed to run your analysis.
I’m a total beginner! How do I learn python and how do I install it? If you’ve never used python before, we recommend you install Anaconda python and learn to code in python using many of the excellent free/affordable resources such: Note: you can get by without python for "getting data", but you will definitely need python later on when you want to process and analyze data.
How do I know which tool to use?
What tool you need to use depends on where the data is located. The vendor will give you access credentials that typically look like one of the below:
AWS S3 Files
access key: AKIAIOSFODNN7EXAMPLE
secret key: wJalrXUtnFEMIK7MDENGbPxRfiCYEX AMPLEKEY
Bucket: s3://nameofbucket/folder/
Vendor API
Endpoint: http://vendor.com/api/v1
Username: usr
Password: pwd
FTP
IP/Endpoint: ftp.vendor.com
Folder: /data/
Username: usr
Password: pwd
Note: these just cover the most common scenarios, for other cases you should consult your vendor.
I’ve never used any of those tools, how do I get started?
Follow the instructions and documentation at the links in the list. Where available, there are is a quickstart link in the notes. Also, attend the Teach-In on July 18th.
I’ve installed the tool, entered vendor information but get a “Cannot connect” or “Connection timed out” error - what do I do?
This is probably the biggest problem you will encounter! First double and triple check that you’ve entered the right information: check user name, password, server name, server port etc. Some vendors restrict access by IP address. Go back to the vendor and double check you have entered the correct information.
Assuming you have entered the correct information, the most likely cause for this error is that you have not configured the use of a proxy server. In large corporates, you typically don’t have direct internet access but need to connect via a proxy server which means you need to give the tools your proxy server information. Either the tools have explicit settings for you to enter proxy information (see e.g. winscp proxy settings). If they do not have an explicit setting, you should configure environment variables HTTP_PROXY and HTTPS_PROXY with the IP addresses for your proxy servers which you can obtain from your IT department. See this or this to learn how to set up environmental variables. You might need to use cntlm if your proxy requires authentication. You may need to restart your computer for the settings to become active.
Even if you have configured the proxy server correctly, you will still get an “Cannot connect to server” error. Large companies might block direct ftp or S3 access. You can ask your IT support whether you can access ftp and Aws S3 via your proxy server. If the answer is no, unless you can find a creative way around it, you will have to get your IT department involved for you to get files.
How can I make regular updates?
You can use the programmatic versions of the links to regularly pull files in an automated fashion. For simple tasks you can create batch files (windows) or shell scripts (linux) and schedule them for automated execution in Task Scheduler (windows) or cron (linux). To manage more complex automated data pipes you should look at Airflow or Luigi.
The list walks you through the alternative data analysis process step-by-step. For each step, you have the key tools including links to their documentation required for you to complete that step. We will publish each step in a succession of articles.The final product will give you the resources needed to analyze alternative data at a cost of $0!
Step-by-Step Process:
- Get Data: copy data from the vendor to where you want to analyze data
- Ingest Data: you may need to manipulate raw data for it to be loaded correctly
- Load Data: import data into the analysis package
- Preprocess Data: clean, filter and transform data for it to be ready for modeling
- Modeling: apply your analysis and draw conclusions
- Presenting: present your insights and conclusions in a digestible format
TEACH-IN: Manipulating Alternative Data with Python
Augvest & AlternativeData.org
Augvest & AlternativeData.org
Taught by Norman Niemer, Chief Data Scientist at UBS Asset Management (QED). Hands-on technical session on using python for ingesting, preprocessing and analyzing data. Will also address common problems encountered with alternative data like data schema changes and mismatched identifiers between vendors. Designed for both beginners as well as experienced coders.
- Thursday, July 19, 2018 | 6:00pm
- New York City
- RSVP HERE
Part 1: Get Data
Goal | Tool | Notes/Tips | Cost |
AWS S3 file download |
S3 browser(graphical) AWS CLI(programmatic) |
|
$0 |
Vendor API data download |
Postman(graphical) Requests(programmatic) |
|
$0 |
FTP file download |
Winscp(graphical) Pyftpsync(programmatic) |
|
$0 |
Corporate proxy authentication | Cntlm |
|
$0 |
Why should I care?
So you’ve found a dataset on AlternativeData.org, the vendor gave you access and you are ready to leverage alternative data! The very first step to analyzing alternative data is to get your hands on the data. Unless you’re working with small sample data that fits in an email, you will like need to use one of these tools. With them, you will be able to obtain the raw data needed to run your analysis.
I’m a total beginner! How do I learn python and how do I install it? If you’ve never used python before, we recommend you install Anaconda python and learn to code in python using many of the excellent free/affordable resources such: Note: you can get by without python for "getting data", but you will definitely need python later on when you want to process and analyze data.
How do I know which tool to use?
What tool you need to use depends on where the data is located. The vendor will give you access credentials that typically look like one of the below:
AWS S3 Files
access key: AKIAIOSFODNN7EXAMPLE
secret key: wJalrXUtnFEMIK7MDENGbPxRfiCYEX
Bucket: s3://nameofbucket/folder/
Vendor API
Endpoint: http://vendor.com/api/v1
Username: usr
Password: pwd
FTP
IP/Endpoint: ftp.vendor.com
Folder: /data/
Username: usr
Password: pwd
Note: these just cover the most common scenarios, for other cases you should consult your vendor.
I’ve never used any of those tools, how do I get started?
Follow the instructions and documentation at the links in the list. Where available, there are is a quickstart link in the notes. Also, attend the Teach-In on July 18th.
I’ve installed the tool, entered vendor information but get a “Cannot connect” or “Connection timed out” error - what do I do?
This is probably the biggest problem you will encounter! First double and triple check that you’ve entered the right information: check user name, password, server name, server port etc. Some vendors restrict access by IP address. Go back to the vendor and double check you have entered the correct information.
Assuming you have entered the correct information, the most likely cause for this error is that you have not configured the use of a proxy server. In large corporates, you typically don’t have direct internet access but need to connect via a proxy server which means you need to give the tools your proxy server information. Either the tools have explicit settings for you to enter proxy information (see e.g. winscp proxy settings). If they do not have an explicit setting, you should configure environment variables HTTP_PROXY and HTTPS_PROXY with the IP addresses for your proxy servers which you can obtain from your IT department. See this or this to learn how to set up environmental variables. You might need to use cntlm if your proxy requires authentication. You may need to restart your computer for the settings to become active.
Even if you have configured the proxy server correctly, you will still get an “Cannot connect to server” error. Large companies might block direct ftp or S3 access. You can ask your IT support whether you can access ftp and Aws S3 via your proxy server. If the answer is no, unless you can find a creative way around it, you will have to get your IT department involved for you to get files.
How can I make regular updates?
You can use the programmatic versions of the links to regularly pull files in an automated fashion. For simple tasks you can create batch files (windows) or shell scripts (linux) and schedule them for automated execution in Task Scheduler (windows) or cron (linux). To manage more complex automated data pipes you should look at Airflow or Luigi.
PART 2: INGEST DATA (next week)
|