Data science is for anyone who loves to unravel tangled things and discover hidden wonders in an apparent mess.
It’s like searching for needles in haystacks; only that data scientists don’t need to get their hands dirty at all. Using fancy tools with colorful charts, and looking at piles of numbers, they just dive into data haystacks and find valuable needles in the form of insights of high business value.
A typical data scientist toolbox should include at least one item of each of these categories: relational databases, NoSQL databases, big data frameworks, visualization tools, scraping tools, programming languages, IDEs, and deep learning tools.
A relational database is a collection of data structured in tables with attributes. The tables can be linked to each other, defining relations and restrictions, and creating what is called a data model. To work with relational databases, you commonly use a language called SQL (Structured Query Language).
The applications that manage the structure and data in relational databases are called RDBMS (Relational DataBase Management Systems). There are a lot of such applications, and the most relevant have recently started to put their focus on the field of data science, adding functionality to work with big data repositories and to apply techniques such as data analytics and machine learning.
This one is Microsoft’s RDBMS, which has been evolving for more than 20 years by consistently expanding its enterprise functionality. Since its 2016 version, SQL Server offers a portfolio of services that include support for embedded R code. SQL Server 2017 raises the bet by renaming its R Services to Machine Language Services and adding support for the Python language (more on these two languages below).
With these important additions, SQL Server aims at data scientists who may not have experience with Transact SQL, the native query language of Microsoft SQL Server.
SQL Server is far from being a free product. You can buy licenses to install it on a Windows Server (the price will vary according to the number of concurrent users) or use it as a fee-based service, through the Microsoft Azure cloud. Learning Microsoft SQL Server is easy.
On the open-source software side, MySQL has the popularity crown of RDBMSs. Although Oracle currently owns it, it is still free and open-sourced under the terms of a GNU General Public License. Most web-based applications use MySQL as the underlying data repository, thanks to its compliance with the SQL standard.
Also helping to its popularity are its easy install procedures, its big community of developers, tons of comprehensive documentation, and third-party tools, such as phpMyAdmin, that simplify everyday management activities. Although MySQL has no native functions to do data analysis, its openness allows its integration with almost any visualization, reporting, and business intelligence tool you may choose.
Another open-source RDBMS option is PostgreSQL. While not as popular as MySQL, PostgreSQL stands out for its flexibility and extensibility, and its support for complex queries, the ones that go beyond the basic statements such as SELECT, WHERE and GROUP BY.
These features are letting it gain popularity among data scientists. Another interesting feature is the support for multi-environments, which allows it to be used in cloud and on-premise environments, or in a mix of both, commonly known as hybrid cloud environments.
PostgreSQL is capable of combining on-line analytical processing (OLAP) with online transaction processing (OLTP), working in a mode called hybrid transactional/analytical processing (HTAP). It is also well suited to work with big data, thanks to the addition of PostGIS for geographical data and JSON-B for documents. PostgreSQL also supports unstructured data, which allows it to be in both categories: SQL and NoSQL databases.
Also known as non-relational databases, this type of data repository provides faster access to non-tabular data structures. Some examples of these structures are graphs, documents, wide columns, key values, among many others. NoSQL data stores can put aside data consistency in favor of other benefits, such as availability, partitioning, and access speed.
Since there is no SQL in NoSQL data stores, the only way to query this kind of database is by using low-level languages, and there is no such language that is as widely accepted as SQL. Besides, there are no standard specifications for NoSQL. That is why, ironically, some NoSQL databases are beginning to add support for SQL scripts.
MongoDB is a popular NoSQL database system, which stores data in the form of JSON documents. Its focus is on the scalability and the flexibility to store data in a non-structured way. This means that there is no fixed field list that must be observed in all the stored elements. Furthermore, the data structure can be changed over time, something that in a relational database implies a high risk of affecting running applications.
The technology in MongoDB allows for indexing, ad-hoc queries, and aggregation that provide a strong foundation for data analysis. The distributed nature of the database provides high availability, scaling, and geographic distribution without the need for sophisticated tools.
This one is another option in the open-source, NoSQL front. It is basically a data structure store that operates in memory and, besides providing database services, it also works as cache memory and message broker.
It supports a myriad of unconventional data structures, including hashes, geospatial indexes, lists, and sorted sets. It is well suited for data science thanks to its high performance in data-intensive tasks, such as computing set intersections, sorting long lists, or generating complex rankings. The reason for Redis’ outstanding performance is its in-memory operation. It can be configured to persists the data selectively.
Big Data frameworks
Suppose you have to analyze the data Facebook users generate during a month. We’re talking about photos, videos, messages, all of it. Taking into account that more than 500 terabytes of data are added every day to the social network by its users, it is hard to measure the volume represented by a whole month of its data.
To manipulate that huge amount of data in an effective way, you need an appropriate framework capable of computing statistics over a distributed architecture. There are two of the frameworks that lead the market: Hadoop and Spark.
As a big data framework, Hadoop deals with the complexities associated with the retrieval, processing, and storing of huge piles of data. Hadoop operates in a distributed environment, composed of computer clusters that process simple algorithms. There is an orchestrating algorithm, called MapReduce, that divides big tasks into small parts, and then distributes those small tasks between available clusters.
Hadoop is recommended for enterprise-class data repositories that require fast access and high availability, all that in a low-cost scheme. But you need a Linux admin with deep Hadoop knowledge to maintain the framework up and to run.
Hadoop is not the only framework available for big data manipulation. Another big name in this area is Spark. The Spark engine was designed to surpass Hadoop in terms of analytics speed and ease of use. Apparently, it achieved this goal: some comparisons say that Spark runs up to 10 times faster than Hadoop when working on a disk, and 100 times faster operating in-memory. It also requires a smaller number of machines to process the same amount of data.
Besides speed, another benefit of Spark is its support for stream processing. This type of data processing, also called real-time processing, involves continuous input and output of data.
A common joke between data scientists says that, if you torture the data long enough, it will confess what you need to know. In this case, “torture” means to manipulate the data by transforming and filtering it, in order to better visualize it. And that’s where data visualization tools come to the scene. These tools take pre-processed data from multiple sources and show its revealed truths in graphical, understandable forms.
There are hundreds of tools that fall into this category. Like it or not, the most widely used is Microsoft Excel and its charting tools. Excel charts are accessible to anyone who uses Excel, but they have limited functionality. The same applies to other spreadsheet applications, such as Google Sheets and Libre Office. But we’re talking here about more specific tools, specially tailored for business intelligence (BI) and data analysis.
Not long ago, Microsoft released its Power BI visualization application. It can take data from diverse sources, such as text files, databases, spreadsheets, and many online data services, including Facebook and Twitter, and use it to generate dashboards packed with charts, tables, maps, and many other visualization objects. The dashboard objects are interactive, meaning that you can click on a data series in a chart to select it and use it as a filter for the other objects on the board.
Power BI is a combination of a Windows desktop application (part of the Office 365 suite), a web application, and an online service to publish the dashboards on the web and share them with your users. The service allows you to create and manage permissions to grant access to the boards only to certain people.
Tableau is another option to create interactive dashboards from a combination of multiple data sources. It also offers a desktop version, a web version, and an online service to share the dashboards you create. It works naturally “with the way you think” (as it claims), and it is easy to use for non-technical people, which is enhanced through lots of tutorials and online videos.
Some of Tableau’s most outstanding features are its unlimited data connectors, its live and in-memory data, and its mobile-optimized designs.
QlikView offers a clean and straightforward user interface to help analysts discover new insights from existing data through visual elements that are easily comprehensible for everyone.
This tool is known for being one of the most flexible business intelligence platforms. It provides a feature called Associative Search, which helps you focus on the most important data, saving you the time it would take to find it on your own.
With QlikView, you can collaborate with partners in real-time, doing comparative analysis. All the pertinent data can be combined into one app, with security features that restrict access to the data.
In the times when the internet was just emerging, the web crawlers started traveling all along with the networks gathering information in their way. As technology evolved, the term web crawling changed for web scraping, but still meaning the same: to automatically extract information from websites. To do web scraping, you use automated processes, or bots, that jump from one webpage to another, extracting data from them and exporting it to different formats or inserting it in databases for further analysis.
Below we summarize the characteristics of three of the most popular web scrapers available today.
Octoparse web scraper offers some interesting characteristics, including built-in tools to obtain information from websites that don’t make it easy for scraping bots to do their jobs. It is a desktop application that requires no coding, with a user-friendly UI that allows visualizing the extraction process through a graphical workflow designer.
Together with the standalone application, Octoparse offers a cloud-based service to speed up the data extraction process. Users can experience a 4x to 10x speed gain when using the cloud service instead of the desktop application. If you stick to the desktop version, you can use Octoparse for free. But if you prefer to use the cloud service, you will have to pick one of its paid plans.
If you’re looking for a feature-rich scraping tool, you should put an eye on Content Grabber. Unlike Octoparse, to use Content Grabber, it’s necessary to have advanced programming skills. In exchange, you get scripting editing, debugging interfaces, and other advanced functionalities. With Content Grabber, you can use .Net languages to write regular expressions. This way, you don’t have to generate the expressions using a built-in tool.
The tool offers an API (Application Programming Interface) that you can use to add scraping capabilities to your desktop and web applications. To use this API, developers need to obtain access to the Content Grabber Windows service.
Like Content Grabber, it is recommended to have some programming knowledge to make the most out of ParseHub. It has a free version, limited to 5 projects, and 200 pages per run.
Just like the previously mentioned SQL language is designed specifically to work with relational databases, there are other languages created with a clear focus on data science. These languages allow the developers to write programs that deal with massive data analysis, such as statistics and machine learning.
SQL is also considered an important skill that developers should have to do data science, but that’s because most organizations still have a lot of data on relational databases. “True” data science languages are R and Python.
Python is a high-level, interpreted, general-purpose programming language, well suited for rapid application development. It has a simple and easy to learn syntax that allows for a steep learning curve and for reductions in the costs of program maintenance. There are many reasons why it is the preferred language for data science. To mention a few: scripting potential, verbosity, portability, and performance.
This language is a good starting point for data scientists who plan to experiment a lot before jumping into the real and hard data crunching work, and who want to develop complete applications.
The R language is mainly used for statistical data processing and graphing. Although it is not meant to develop full-fledged applications, as would be the case for Python, R has become very popular in recent years due to its potential for data mining and data analytics.
Thanks to an ever-growing library of freely available packages that extend its functionality, R is capable of doing all kinds of data-crunching work, including linear/nonlinear modeling, classification, statistical tests, etc.
It is not an easy language to learn, but once you get acquainted with its philosophy, you will be doing statistical computing like a pro.
If you are seriously considering dedicating yourself to data science, then you will need to choose carefully an integrated development environment (IDE) that suits your needs, because you and your IDE will spend a lot of time working together.
An ideal IDE should put together all the tools you need in your every day work as a coder: a text editor with syntax highlighting and auto-completion, a powerful debugger, an object browser, and easy access to external tools. Besides, it must be compatible with the language of your preference, so it is a good idea to choose your IDE after knowing which language you will use.
This generic IDE is mostly intended for scientists and analysts that also need to code. To make them comfortable, it doesn’t limit itself to the IDE functionality –it also provides tools for data exploration/visualization and interactive execution, as could be found on a scientific package. The editor in Spyder supports multiple languages and adds a class browser, window splitting, jump-to-definition, auto code completion, and even a code analysis tool.
The debugger helps you trace each line of code interactively, and a profiler helps you find and eliminate inefficiencies.
If you program in Python, chances are that your IDE of choice will be PyCharm. It has a smart code editor with smart search, code completion, and error detection and fixing. With just one click, you can jump from the code editor to any context-related window, including test, super method, implementation, declaration, and more. PyCharm supports Anaconda and many scientific packages, such as NumPy and Matplotlib, to name just two of them.
It offers integration with the most important version control systems, and also with a test runner, a profiler and a debugger. To close the deal, it also integrates with Docker and Vagrant to provide for cross-platform development and containerization.
For those data scientists that prefer the R team, the IDE of choice should be RStudio, because of its lots of features. You can install it on a desktop with Windows, macOS, or Linux, or you could run it from a web browser if you don’t want to install it locally. Both versions offer goodies such as syntax highlighting, smart indentation, and code completion. There’s an integrated data viewer that comes in handy when you need to browse tabular data.
The debugging mode allows viewing how the data is being updated dynamically when executing a program or script step-by-step. For version control, RStudio integrates support for SVN and Git. A nice plus is a possibility to author interactive graphics, with Shiny and gives libraries.
Your personal toolbox
At this point, you should have a complete view of the tools you should know to excel in data science. Also, we hope we gave you enough information to decide which is the most convenient option within each tool category. Now it’s up to you. Data science is a flourishing field were to develop a career. But if you want to do so, you must keep up with the changes in trends and technologies, since they occur almost on a daily basis.