New Courses: Learn Command Line Fundamentals for Data Science

Reasons to Learn the Command Line

Making the switch from graphical user interfaces (GUIs) to a CLI can feel overwhelming, but we’re here to help you! To give you a jump-start, here are a few reasons why you should be learning the command line.

1. Command Line Skills Are Popular and Pay Handsomely

According to 2018’s Stack Overflow’s Developer Survey, bash/shell (i.e. the family of Linux command language interpreters) is the sixth most used language overall, ranking ahead of Python and R. It was also associated with higher salaries than either Python or R, according to the survey.

It also made the list of the most wanted and most loved technologies, while staying clear of the most dreaded technologies list.

And while StackOverflow’s survey covers software developers and engineers of all sorts, the command line is of particular relevance for data scientists because Bash/Shell correlates heavily with Data Science technologies like Python, IPython/Jupyter, TensorFlow and PyTorch. This is also supported by the most recent Python Developers Surveyconducted by Python Software Foundation.

2. Command Line Skills Help With Building Repeatable Data Processes

Part of a data scientist’s role is to make sure certain information is available regularly, often daily. Most of the time this data is acquired, processed and displayed in the same way.

The command line is well suited for this purpose because commands are easily automated and replicated.

Consider the following situation. Your employer decided to invest in data analytics. Several data professionals will be joining the team. You are tasked with making sure that their machines have everything they need to get started. If you can work with a CLI (command language interpreter) you can write a few scripts that will install, configure and test everything automatically. If you don’t, you’ll have to resort to a GUI and make the same mouse and click movements, repeatedly, across several machines.

That’s just one example of how terminal skills can help make data science processes more scalable and repeatable.

3. Command Line Skills Make You More Flexible

In a data science role, you’ll often find you have more flexibility if you can use the terminal rather than having to rely on clicking through GUIs. Since the command line is a program that runs other programs (hence the name “shell”), the interaction between programs is often easier to adjust in the command line. Once you’ve mastered command line commands, it’s relatively easy to write scripts, and shell scripts make building all sorts of data pipelines and workflows much simpler.

More broadly, knowing how to use the shell gives you a second option for interacting with your computer. You can always use the GUI when you prefer, but the command line can provide you with more direct power and control for those times when you need it.

4. Working With Text Files is Easier

Text files are among the most common ways methods to store and handle data, and almost any data science project is going to involve some work with text files. Being able to handle text files quickly and efficiently is thus a very useful skill for a data scientist.

The shell has very powerful text processing tools like AWKand sed, which help with getting acquainted with files and facilitate data cleaning.

For example, the code below uses AWK to print the first and third columns of a file named a_csv_file, where the second field’s value is Dataquest, using a comma as a field separator.

All it takes is one line of code!

5. It’s Less Resource-Intensive

When you’re working with limited computing resources or simply want to maximize your speed, the using the command line is virtually always going to be better than using a GUI because using a GUI means resources must be dedicated to rendering the graphical output.

This is true both for working locally and remotely. When connecting remotely, GUIs consume much more bandwidth than terminals, wasting resources. Moreover, latency, i.e. the “time interval between the stimulation and response”, will be higher when using a GUI, which can be particularly frustrating if you’re trying to control a mouse that’s a second or two behind your actual movements. If you’re just typing in the command line, the latency is likely to be lower and it will also be easier to handle since you know precisely where your cursor is at any given time.

6. You Need Command Line Skills for the Cloud

Cloud services often are connected to and operated through a command line interface. This is particularly important for more advanced data science work like deep learning, where your local computing resources are likely to be insufficient for the tasks you’d like to perform. To quote from this 2018 article by Nucleus Research:

In last year’s research, fewer than 10 percent of [deep learning] projects were being run on premise. That trend has accelerated, with only 4 percent of projects running on-premise in 2018.

According to the same article, “96 percent of deep learning today is running in the cloud.” If you’re interested in learning advanced techniques like deep learning, command line skills will be necessary for moving your data to and from the cloud efficiently.

7. Unix Shell Skills Transfer Well to Other Shells

There are just a few popular shells (bash, zsh, fish, ksh, tcsh, cmd, Windows PowerShell, etc.) and they are more alike than they are different, making it easy to switch between them. This is particularly useful when you’re using online services that require some kind of CLI. On the other hand, GUIs are endless, and learning one won’t necessarily help you learn any others.

Leave a Reply

Your email address will not be published. Required fields are marked *