Tim Rey

Applied Data Mining for Forecasting Using SAS


Скачать книгу

href="#ulink_86bd565c-70ac-5e7b-b68f-1488133ee396">3.4.1 Internal Data Infrastructure

       3.4.2 External Data Infrastructure

       3.5 Organizational Infrastructure

       3.5.1 Developers Infrastructure

       3.5.2 Users Infrastructure

       3.5.3 Work Process Implementation

       3.5.4 Integration with IT

      Applying data mining for forecasting in a business requires serious investments in hardware, software, and training, but a cultural change must also take place. It is very important to estimate the size of the investment based on technical requirements and the products that are available in the market. The four main components of any forecasting infrastructure are hardware, software, data, and organizational. The first three components build the technical basis to support applied data mining for forecasting, and the fourth component is critical to effectively change the culture of the organization. This chapter is focused on an enterprise-wide implementation strategy of data mining for forecasting. The importance of integrating the selected options into the existing corporate infrastructure is discussed at the end of the chapter.

      The objective of this section is to give the reader a condensed overview of the potential hardware architectures for implementing data mining for forecasting systems in an industrial setting. The following three options: (1) PC network, (2) client/server, and (3) cloud computing infrastructures are discussed briefly below. However, due to rapid technology changes today's recommendations can easily become obsolete tomorrow.

      The least expensive hardware solution for implementing data mining for forecasting systems in an industrial setting is to avoid any additional hardware expenses and use the existing information system infrastructure. Usually, this is based on a PC network. The key advantages of this option are as follows:

       low cost

       easy integration in the existing information system infrastructure

       minimal installation and maintenance efforts

       robust performance due to the decentralized architecture

      The main limitations of the PC network infrastructure solution for implementing data mining for forecasting systems are as follows:

       limitations for large data set processing

       slower processing speed relative to servers

       limited operating systems options

      The client/server model assumes a division of the computing resources between clients or workstations with local processing capabilities and servers with large memory and disk space and more powerful processors. The clients request services such as data, and the servers retrieve resources and deliver the requested information. The number of servers required depends on the number of clients, network speed and capacity, global and local operation, reliability, and so on.

      An example of a minimal client/server infrastructure based on SAS is shown in Figure 3.1. The example includes four types of servers and two types of clients—modeler PC and final user PC. One server is allocated to handle metadata. A data mart server, based on Oracle, interacts with the large database cluster containing the corporate data. The third server includes the SAS server and is devoted to intensive computing tasks. Several clients can share the server resources either for developing new models or running developed models as stored processes.

      The key advantages of the client/server infrastructure for implementing data mining for forecasting are given below:

       very powerful processing capabilities

       large memory and high-throughput disk

       the use of different operating systems

       capacity to process large data sets.

images

      The disadvantages of this option are as follows:

       high cost

       more complex maintenance and support

       lower reliability if servers are down

      The advantages, however, outweigh the disadvantages and the client/server infrastructure is the standard solution for large-scale industrial applications of data mining and forecasting.

      Another potential solution, called cloud computing, uses powerful external and internal computing resources, and includes grid computing for parallel processing, multi-tiered computer architecture, and the capacity to handle super-large data sets. Such services are currently offered by a number of vendors including well-established industry leaders. Some of the advantages of using this option are as follows:

       low implementation and maintenance cost

       super-computer power, which is continuously upgraded by the cloud owner

       data consolidation in very large data sets

       increased reliability

      The disadvantages of using a cloud computing infrastructure are summarized as follows:

       proprietary data security

       initial transfer of very large corporate data to the cloud

       limited software

       trust issues

       information technology (IT) management resistance

      This option is still in an exploratory phase and has generated a lot of hype. However, if the technical and economic advantages are proved with more industrial applications, it could become a popular hardware infrastructure in the near future.

      The lion's share of the costs for implementing data mining for forecasting systems, especially for the PC network infrastructure, is not the cost of hardware but the cost of software infrastructure. One of the key decisions to make in advance is the scale of the efforts. In the case of large-scale forecasting on a corporate level that is to be implemented across the globe, an integrated software environment made up of all necessary components with global support is strongly recommended. An example of such infrastructure (based on SAS software) is discussed in this book.

      This part of the infrastructure strongly depends on the existing corporate information system architecture. Unfortunately, it could be very diverse with different database platforms. In most cases, however, the data are organized in relational databases and stored in separate tables for each entity. The relationship between the tables is defined by two columns—primary key and foreign key columns (Svolba 2006). Data that are accessed from a relational database are usually extracted table by table and are merged according to the primary