Lillian Pierson

Data Science For Dummies


Скачать книгу

results.

       Craft site-recommendation engines for use in land acquisitions and real estate development.

       Implement and interpret predictive analytics and forecasting techniques for net increases in business value.

      Data scientists must have extensive and diverse quantitative expertise to be able to solve these types of problems.

Machine learning is the practice of applying algorithms to learn from — and make automated predictions from — data.

      Defining machine learning engineering

      A machine learning engineer is essentially a software engineer who is skilled enough in data science to deploy advanced data science models within the applications they build, thus bringing machine learning models into production in a live environment like a Software as a Service (SaaS) product or even just a web page. Contrary to what you may have guessed, the role of machine learning engineer is a hybrid between a data scientist and a software engineer, not a data engineer. A machine learning engineer is, at their core, a well-rounded software engineer who also has a solid foundation in machine learning and artificial intelligence. This person doesn’t need to know as much data science as a data scientist but should know much more about computer science and software development than a typical data scientist.

      

Software as a Service (SaaS) is a term that describes cloud-hosted software services that are made available to users via the Internet. Examples of popular SaaS companies include Salesforce, Slack, HubSpot, and so many more.

      Defining data engineering

      If engineering is the practice of using science and technology to design and build systems that solve problems, you can think of data engineering as the engineering domain that’s dedicated to building and maintaining data systems for overcoming data processing bottlenecks and data handling problems that arise from handling the high volume, velocity, and variety of big data.

      Data engineers use skills in computer science and software engineering to design systems for, and solve problems with, handling and manipulating big datasets. Data engineers often have experience working with (and designing) real-time processing frameworks and massively parallel processing (MPP) platforms (discussed later in this chapter), as well as with RDBMSs. They generally code in Java, C++, Scala, or Python. They know how to deploy Hadoop MapReduce or Spark to handle, process, and refine big data into datasets with more manageable sizes. Simply put, with respect to data science, the purpose of data engineering is to engineer large-scale data solutions by building coherent, modular, and scalable data processing platforms from which data scientists can subsequently derive insights.

Most engineered systems are built systems — they are constructed or manufactured in the physical world. Data engineering is different, though. It involves designing, building, and implementing software solutions to problems in the data world — a world that can seem abstract when compared to the physical reality of the Golden Gate Bridge or the Aswan Dam.

      Using data engineering skills, you can, for example:

       Integrate data pipelines with the natural language processing (NLP) services that were built by data scientists at your company.

       Build mission-critical data platforms capable of processing more than 10 billion transactions per day.

       Tear down data silos by finally migrating your company’s data from a more traditional on-premise data storage environment to a cutting-edge cloud warehouse.

       Enhance and maintain existing data infrastructure and data pipelines.

      Data engineers need solid skills in computer science, database design, and software engineering to be able to perform this type of work.

      Comparing machine learning engineers, data scientists, and data engineers

      The roles of data scientist, machine learning engineer, and data engineer are frequently conflated by hiring managers. If you look around at most position descriptions for companies that are hiring, they often mismatch the titles and roles or simply expect applicants to be the Swiss army knife of data skills and be able to do them all.

      

If you’re hiring someone to help make sense of your data, be sure to define the requirements clearly before writing the position description. Because data scientists must also have subject matter expertise in the particular areas in which they work, this requirement generally precludes data scientists from also having much expertise in data engineering. And, if you hire a data engineer who has data science skills, that person generally won’t have much subject matter expertise outside of the data domain. Be prepared to call in a subject matter expert (SME) to help out.

      Lastly, keep in mind that data engineer, machine learning engineer, and data scientist are just three small roles within a larger organizational structure. Managers, middle-level employees, and business leaders also play a huge part in the success of any data-driven initiative.

      A lot has changed in the world of big data storage options since the Hadoop debacle I mention earlier in this chapter. Back then, almost all business leaders clamored for on-premise data storage. Delayed by years due to the admonitions of traditional IT leaders, corporate management is finally beginning to embrace the notion that storing and processing big data with a reputable cloud service provider is the most cost-effective and secure way to generate value from enterprise data. In the following sections, you see the basics of what’s involved in both cloud and on-premise big data storage and processing.

      Storing data and doing data science directly in the cloud

      After you have realized the upside potential of storing data in the cloud, it’s hard to look back. Storing data in a cloud environment offers serious business advantages, such as these:

       Faster time-to-market: Many big data cloud service providers take care of the bulk of the work that’s required to configure, maintain, and provision the computing resources that are required to run jobs within a defined system – also known as a compute environment. This dramatically increases ease of use, and ultimately allows for faster time-to-market for data products.

       Enhanced flexibility: Cloud services are extremely flexible with respect to usage requirements. If you set up in a cloud environment and then your project plan changes, you can simply turn off the cloud service with no further charges incurred. This isn’t the case with on-premise storage, because once you purchase the server, you own it. Your only option from then on is to extract the best possible value from a noncancelable resource.

       Security: If you go with reputable cloud service providers — like Amazon Web Services, Google Cloud, or Microsoft Azure — your data is likely to be a whole lot more secure in the cloud than it would be on-premise. That’s because of the sheer number of resources that these megalith players dedicate to protecting and preserving the security of the data they store. I can’t think of a multinational company that would