DataFlux Data Management Studio: Basics Course Notes DataFlux Data Management Studio: Basics Course Notes was developed by Kari Richardson. Editing and production support was provided by the Curriculum Development and Support Department. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are trademarks of their respective companies. DataFlux Data Management Studio: Basics Course Notes Copyright © 2012 SAS Institute Inc. Cary, NC, USA. All rights reserved. Printed in the United States of America. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, or otherwise, without the prior written permission of the publisher, SAS Institute Inc. Book code E2229, course code DQ22DMP1, prepared date 09May2012. DQ22DMP1_001 ISBN 978-1-61290-309-5 For Your Information Table of Contents Course Description ...................................................................................................................... vi Prerequisites ............................................................................................................................... vii Chapter 1 Introduction to DataFlux Methodology and Course Flow ................. 1-1 1.1 Introduction ...................................................................................................................... 1-3 1.2 Course Flow ..................................................................................................................... 1-6 Chapter 2 2.1 DataFlux Data Management Studio: Getting Started ......................... 2-1 Introduction ...................................................................................................................... 2-3 Demonstration: Navigating the DataFlux Data Management Studio Interface......... 2-6 Demonstration: Creating a DataFlux Data Management Studio Repository .......... 2-20 Exercises.................................................................................................................. 2-26 2.2 Quality Knowledge Base and Reference Sources .......................................................... 2-27 Demonstration: Verifying the Course QKB and Reference Sources ....................... 2-31 2.3 Data Connections ........................................................................................................... 2-33 Demonstration: Working with Data Connections ................................................... 2-36 Exercises.................................................................................................................. 2-53 2.4 Solutions to Exercises .................................................................................................... 2-54 Chapter 3 3.1 PLAN ...................................................................................................... 3-1 Creating Data Collections ................................................................................................ 3-3 Demonstration: Creating Collections of Address and Note Fields ........................... 3-5 Exercises.................................................................................................................. 3-13 Demonstration: Creating and Exploring a Data Exploration ................................. 3-18 3.3 Creating Data Profiles .................................................................................................... 3-32 Demonstration: Creating and Exploring a Data Profile .......................................... 3-39 iii iv For Your Information Exercises.................................................................................................................. 3-53 Demonstration: Profiling Data Using Text File Input ............................................ 3-56 Demonstration: Profiling Data Using Filtered Data and an SQL Query ................ 3-61 Demonstration: Profiling Data Using a Collection ................................................ 3-68 Exercises.................................................................................................................. 3-70 Demonstration: Data Profiling – Additional Analysis (Self-Study) ........................ 3-73 3.4 Designing Data Standardization Schemes ..................................................................... 3-83 Demonstration: Creating a Phrase Standardization Scheme .................................. 3-85 Demonstration: Creating an Element Standardization Scheme .............................. 3-92 Demonstration: Importing a Scheme from a Text File ........................................... 3-96 Exercises.................................................................................................................. 3-98 3.5 Solutions to Exercises .................................................................................................... 3-99 Chapter 4 4.1 ACT ........................................................................................................ 4-1 Introduction to Data Jobs ................................................................................................. 4-3 Demonstration: Setting DataFlux Data Management Studio Options ..................... 4-5 Demonstration: Creating and Running a Simple Data Job ..................................... 4-10 Exercises.................................................................................................................. 4-33 4.2 Data Quality Jobs ........................................................................................................... 4-34 Demonstration: Investigating Standardizing, Parsing, and Casing ......................... 4-38 Demonstration: Investigating Right Fielding and Identification Analysis ............. 4-50 Exercises.................................................................................................................. 4-61 Demonstration: Using a Standardization Definition and a Standardization Scheme .......................................................................................... 4-63 4.3 Data Enrichment Jobs (Self-Study)................................................................................ 4-75 Demonstration: Performing Address Verification and Geocoding .......................... 4-77 Exercises.................................................................................................................. 4-88 4.4 Entity Resolution Jobs ................................................................................................... 4-90 Demonstration: Creating a Data Job to Cluster Records......................................... 4-93 For Your Information Exercises................................................................................................................ 4-108 Demonstration: Creating an Entity Resolution Job ............................................... 4-112 Demonstration: Creating a Data Job to Compare Clusters (Optional) .................. 4-138 4.5 Multi-Input/Multi-Output Data Jobs (Self-Study) ....................................................... 4-147 Demonstration: Multi-Input/Multi-Output Data Job: New Products .................... 4-149 Demonstration: Multi-Input/Multi-Output Data Job: Customer Matches ............. 4-159 Exercises................................................................................................................ 4-166 Chapter 5 5.1 MONITOR............................................................................................... 5-1 Business Rules ................................................................................................................. 5-3 Demonstration: Creating a Row-Based Business Rule ............................................. 5-6 Exercises.................................................................................................................. 5-11 Demonstration: Adding a Business Rule and an Alert to a Profile ......................... 5-15 Demonstration: Performing a Historical Visualization ........................................... 5-21 Exercises.................................................................................................................. 5-25 Demonstration: Creating a Monitoring Job for a Row-Based Rule ........................ 5-29 Demonstration: Creating Another Data Job Calling the Same Task ....................... 5-38 Exercises.................................................................................................................. 5-41 Demonstration: Creating a Monitoring Job for a Set-Based Rule .......................... 5-42 Exercises.................................................................................................................. 5-48 Demonstration: Viewing the Monitor Dashboard ................................................... 5-50 v vi For Your Information Course Description This course is for data quality stewards who perform data management tasks, such as data quality, data enrichment and entity resolution. To learn more… For information about other courses in the curriculum, contact the SAS Education Division at 1-800-333-7660, or send e-mail to [email protected]. You can also find this information on the Web at support.sas.com/training/ as well as in the Training Course Catalog. For a list of other SAS books that relate to the topics covered in this Course Notes, USA customers can contact our SAS Publishing Department at 1-800-727-3228 or send e-mail to [email protected]. Customers outside the USA, please contact your local SAS office. Also, see the Publications Catalog on the Web at support.sas.com/pubs for a complete list of books and a convenient order form. For Your Information Prerequisites There currently are no prerequisites for this course. vii viii For Your Information Chapter 1 Introduction to DataFlux Methodology and Course Flow 1.1 Introduction..................................................................................................................... 1-3 1.2 Course Flow .................................................................................................................... 1-6 1-2 Chapter 1 Introduction to DataFlux Methodology and Course Flow 1.1 Introduction 1.1 Introduction Objectives Provide an overview of DataFlux. Explore the DataFlux Data Management methodology. 3 Introduction to DataFlux DataFlux is a leading provider of data management solutions. Founded in 1997 Acquired by SAS in 2000 Operates as a wholly owned subsidiary of SAS SAS is the world’s largest privately held software company. Market leader in analytics and business intelligence solutions DataFlux technology is embedded in several SAS applications. 4 4 1-3 1-4 Chapter 1 Introduction to DataFlux Methodology and Course Flow DataFlux Data Management Methodology DEFINE CONTROL PLAN MONITOR DISCOVER EVALUATE DESIGN EXECUTE ACT 5 5 The DataFlux Data Management Methodology is a step-by-step process for performing data management tasks, such as data quality, data integration, data migrations, and MDM (Master Data Management). When organizations plan, take action on, and monitor data management projects, they build the foundation to optimize revenue, control costs, and mitigate risks. No matter what type of data you manage, DataFlux technology can help you gain a more complete view of corporate information. PLAN DEFINE The planning stage of any data management project starts with this essential first step. This is where the people, processes, technologies, and data sources are defined. Roadmaps are built that include articulating the acceptable outcomes. Finally, the cross-functional teams across business units and between business and IT communities are created to define the data management business rules. DISCOVER A quick inspection of your corporate data would probably find that it resides in many different databases. The data is managed by many different systems, with many different formats and representations of the same data. This step of the methodology enables you to explore metadata to verify that the right data sources are included in the data management program. It also enables you to create detailed data profiles of identified data sources to understand their strengths and weaknesses. 1.1 Introduction 1-5 ACT DESIGN After you complete the first two steps, this phase enables you to take the different structures, formats, data sources, and data feeds, and create an environment that accommodates the needs of your business. At this step, business and IT users build jobs to enforce business rules for data quality and data integration. They create data models to house data in consolidated or master data sources. EXECUTE After business users establish how the data and rules should be defined, the IT staff can install them in the IT infrastructure and determine the integration method – real time, batch, or virtual. These business rules can be reused and redeployed across applications, which helps increase data consistency in the enterprise. MONITOR EVALUATE This step of the methodology enables users to define and enforce business rules to measure the consistency, accuracy, and reliability of new data as it enters the enterprise. Reports and dashboards about critical data metrics are created for business and IT staff members. The information gained from data monitoring reports is used to refine and adjust the business rules. CONTROL The final stage in a data management project involves examining any trends to validate the extended use and retention of the data. Data that is no longer useful is retired. The project’s success can then be shared throughout the organization. The next steps are communicated to the data management team to lay the groundwork for future data management efforts. 1-6 Chapter 1 Introduction to DataFlux Methodology and Course Flow 1.2 Course Flow Objectives Review the steps for the course. Overview the course data. 7 Course Tasks – Step 1 Step 1: DataFlux Data Management Studio Basics Explore the DataFlux Data Management Studio Interface Create Course Repositories Define Data Connections 8 8 This step involves becoming familiar with the DataFlux Data Management Studio interface. As part of the exploration, you define a repository and verify the course QKB (Quality Knowledge Base). In addition, a new data connection is defined. 1.2 Course Flow 1-7 Course Tasks – Step 2 Step 2: PLAN: Connect to, explore, and profile data; design standardization schemes. Explore data values in a data profile Explore data information in a field relationship map Build a scheme 9 9 The PLAN phase of the DataFlux Data Management Methodology involves exploring the structural information in the data sources and analyzing the data values of the data sources. As a result of that analysis, this phase builds necessary or needed schemes. Course Tasks – Step 3 Step 3: ACT: Build data jobs for quality and entity resolution. 10 1 0 The analysis and exploration of the data sources can lead to the discovery of data quality issues. The ACT phase is designed to create the data jobs that cleanse or correct the data. The steps involved do the following: • Standardize, parse, and/or case the data • Correctly identify types of data (identification analysis) • Perform methods to remove duplicates from data sources, or to join tables with no common key 1-8 Chapter 1 Introduction to DataFlux Methodology and Course Flow Course Tasks – Step 4 Step 4: MONITOR: Create and implement various types of business rules. 11 1 1 After data corrections are applied, a final phase in the DataFlux Data Management Methodology is to set up processes to monitor various aspects of the data. Course Data dfConglomerate has many lines of business and is need of an enterprise-level view of its data. This course works with two lines of data: dfConglomerate Gift (used in demonstrations) dfConglomerate Grocery (used in exercises) 12 Gifts Grocery Clothing Auto 1.2 Course Flow 1-9 Founded in 1926, dfConglomerate pursued a policy of managed growth and acquisitions to transform itself into an organization with world-wide operations and more than 30,000 employees. Headquartered in Toledo, Ohio, dfConglomerate has offices in London, Frankfurt, Rio de Janeiro, Tokyo, and Bangalore. Because dfConglomerate operates across dozens of different lines of business, the reach of its products and services encompasses everything from heavy mining and manufacturing to automobiles and consumer goods. dfConglomerate prides itself on its decentralized management structure, with each division having a high degree of autonomy over its own operations. Recently, corporate level executives realized that the company’s history of mergers and acquisitions endowed it with a legacy of data silos, incompatible systems, and bad and duplicate data. It became apparent that the company lacked the essential enterprise-level view of its data. It also realized that to achieve this view, the company needed to improve the overall quality of its data. To achieve this, dfConglomerate began establishing data management centers of excellence, and used these as pilot programs with the intent of replicating them across the enterprise. There are five tables that exist in the dfConglomerate Gifts Microsoft Access database. There are two tables that exist in the dfConglomerate Grocery Microsoft Access database. 1-10 Chapter 1 Introduction to DataFlux Methodology and Course Flow Chapter 2 DataFlux Data Management Studio: Getting Started 2.1 Introduction..................................................................................................................... 2-3 Demonstration: Navigating the DataFlux Data Management Studio Interface ...................... 2-6 Demonstration: Creating a DataFlux Data Management Studio Repository ....................... 2-20 Exercises .............................................................................................................................. 2-26 2.2 Quality Knowledge Base and Reference Sources ..................................................... 2-27 Demonstration: Verifying the Course QKB and Reference Sources ................................... 2-31 2.3 Data Connections ......................................................................................................... 2-33 Demonstration: Working with Data Connections ................................................................. 2-36 Exercises .............................................................................................................................. 2-53 2.4 Solutions to Exercises ................................................................................................. 2-54 2-2 Chapter 2 DataFlux Data Management Studio: Getting Started 2.1 Introduction 2.1 Introduction Objectives Provide an overview of the components of the DataFlux Data Management Platform. Define key components and terms of DataFlux Data Management Studio. Define a DataFlux Data Management Studio repository. 3 DataFlux Data Management Platform DataFlux Federation Server DataFlux Data Management Server Real-time / Transactional Bulk / Batch Virtual Access DataFlux Authentication Server DataFlux Data Management Studio Desktop Client (Windows) PLAN 4 ACT MONITOR On-Demand Data Integration 2-3 2-4 Chapter 2 DataFlux Data Management Studio: Getting Started DataFlux Data Management Studio provides a single interface for both business and IT users. Built on DataFlux's technology foundation, DataFlux Data Management Studio provides a unified development and delivery environment, giving organizations a single point of control to manage data quality, data integration, master data management (MDM), or any other enterprise data initiative. DataFlux Data Management Studio enables top-down design of business process and rules, as well as bottom-up development and delivery of specific components in a single user environment. Building jobs and business rules in DataFlux Data Management Studio enables transparency in the design process, collaborative development, and change management during and after the deployment. Working in unison with DataFlux Data Management Studio, DataFlux Data Management Server helps form the backbone of the DataFlux Data Management Platform. The technology can implement the rules created in DataFlux Data Management Studio in both batch and real-time environments. Through an innovative, standards-based service-oriented architecture (SOA), DataFlux Data Management Server frees you from writing and testing hundreds of lines of extra code to enable real-time data quality, data integration, and data governance routines. DataFlux Data Management Server is the workhorse that enables pervasive data quality, data integration, and master data management (MDM) throughout your organization. DataFlux provides integrated views of data located in multiple repositories without the need to physically consolidate the data. This virtual integration enables powerful querying capabilities, as well as improved source data management. Data federation efficiently joins data from many heterogeneous sources, without moving or copying the data. The result is improved performance and speed, while reducing costly and labor-intensive data movement and consolidation. DataFlux Federation Server enables organizations to view data from multiple sources through an integrated, virtual data view. While the data remains stored in its source application, multiple users can view integrated data without physically moving or changing data. The data appears as if it were physically integrated in one place, either as a table, file, or Web call. The DataFlux Authentication Server acts as a central point of security for users of the DataFlux Data Management Platform. By providing authentication services, the Authentication Server helps the users of DataFlux Data Management Studio securely connect to the DataFlux Data Management Server and/or database servers across your enterprise. 2.1 Introduction DataFlux Data Management Terminology Home Tab Toolbar Main Menu Navigation Pane Information Pane Navigation Riser Bars 5 DataFlux Data Management Terminology Primary Tabs Secondary Toolbar Secondary Tabs Resource Panes Work Area Details Pane 6 Detach Selected Tab 2-5 2-6 Chapter 2 DataFlux Data Management Studio: Getting Started Navigating the DataFlux Data Management Studio Interface In this demonstration you explore and navigate the interface for DataFlux Data Management Studio. 1. Select Start All Programs DataFlux Data Management Studio 2.2. A splash screen appears as DataFlux Data Management Studio is initialized. 2.1 Introduction 2-7 An interactive Help window for the DataFlux Data Management Methodology also appears by default. 2. Clear the Display Data Management Methodology on Startup check box. (This check box appears in the right of the bottom panel of this interactive Help window.) The DataFlux Data Management Methodology Help can be redisplayed in DataFlux Data Management Studio by clicking anyplace in the Data Management Methodology portlet. Data Management Methodology Portlet 3. Click to close the DataFlux Data Management Methodology Help window. The left side of the interface displays navigation information that includes a navigation pane and a Navigation Risers Bar area. The right side of the interface displays the information pane for the selected element of the navigation risers bar. The main menu and a toolbar are also on the Home tab. 2-8 Chapter 2 DataFlux Data Management Studio: Getting Started 4. Verify that the main menu and toolbar appear on the Home tab. Home Tab Main Menu Toolbar 5. Verify that the left side of the interface displays navigation information. Navigation Pane Navigation Area Navigation Riser Bars 2.1 Introduction 6. Verify the right side of the interface displays the Information pane. Information Pane Notice the following: • The Information riser bar is selected in the Navigation Riser Bars area. • The Navigation pane displays the items for information. • The Overview item is selected (in the Navigation pane). • The Information pane displays the overview information. • The Overview information area consists of six portlets. 2-9 2-10 Chapter 2 DataFlux Data Management Studio: Getting Started 7. Verify the view of the Overview information area of six portlets. Recent Files Methodology Data Roundtable Discussions Documentation Settings Links 8. Click anyplace in the DataFlux Data Management Methodology portlet. DataFlux Data Management Methodology Portlet The DataFlux Data Management Methodology window appears. 2.1 Introduction 2-11 a. Click the DEFINE component of the methodology. Information about steps to perform in the Define portion of the methodology appears. There are four main items in the Define portion of the methodology (Connect to Data, Explore Data, Define Business Rules, Build Schemes). Selecting any of these items provides a brief overview and a link to more in-depth help. b. Click Explore Data. c. Click Click here to learn more. 2-12 Chapter 2 DataFlux Data Management Studio: Getting Started The online Help for DataFlux Data Management Studio appears for the topic of Data Explorations. d. Click e. Click to close the Help window. to close the DataFlux Data Management Methodology Help window. 2.1 Introduction 2-13 9. In the Navigation area, click the Data riser bar. The Data riser bar allows a user to work with collections, data connections and master data foundations. Collections folder A collection is a set of fields that are selected from tables that are accessed from different data connections. A collection provides a convenient way for a user to build a data set using those fields. A collection can be used an an input source for other components in Data Management Studio, such as the Data Viewer, profiles, queries, and data exploarations. Data Connections folder Data connections are used to access data in jobs, profiles, data explorations and data collections. Master Data Foundations folder The Master Data Foundation feature in Data Management Studio uses master data projects and entity definitions to develop the best possible record for a specific resource, such as a customer or a product, from all of the source systems that might contain a reference to that resource. 2-14 Chapter 2 DataFlux Data Management Studio: Getting Started 10. Click Data Connections in the Navigation pane. Information area displaying information about data connections The information area provides overview information about all defined data connections such as names, descriptions, and types. Different types of data connections can be made in DataFlux Data Management Studio: • ODBC connections • Federated Server connections • Localized DSN connections • Custom connections • SAS Data Set connections ODBC Connection Displays the Microsoft Windows ODBC Data Source Administrator dialog, which you can use to create ODBC connections Federated Server Connection Enables you to create a federated server data connection. A federated connection enables you to establish a threaded connection to a database Localized DSN Connection Enables you to create a localized DSN connection definition for aspecific data source that is available via the federated server. This connection definition is used when you access federated tables. These localized connections are federated server connections to a DBMS that are named and created as an extra connection to the same source in metadata Custom Connection Enables you to create a custom connection string for non-ODBC connection types. These custom strings are useful for establishing native connections to third-party databases or drawing data from more than one type of data input SAS Data Set Connection Enables you to create SAS data set connections It is a best practice to save user credentials for your Data Connections. 11. Click to expand the Data Connections item. 12. Click the DataFlux Sample data connection in the Navigation pane. 2.1 Introduction The tables accessible through this data connection are revealed in the information area. 2-15 2-16 Chapter 2 DataFlux Data Management Studio: Getting Started 13. In the Navigation area, click the Data Management Servers riser bar. In this Navigation pane, you can define new server instances or import jobs to process on the server. a. Click in front of Data Management Servers. b. Click DataFlux Training Server. The information area displays specifics about this defined server. 2.1 Introduction 14. In the Navigation area, click the Administration riser bar. The Administration navigation pane enables you to manage various items such as repositories, the Quality Knowledge Base (QKB), reference sources (data packs), and macro files. 2-17 2-18 Chapter 2 DataFlux Data Management Studio: Getting Started What Is a Repository? Data Management Studio Collection of data and metadata Repository around DataFlux objects The DataFlux Data Management Studio repositories are used to do the following: Organize work Show lineage The repositories consist of two parts: Data Storage File Storage 13 The data storage portion of a repository can contain the following: • Explorations and reports • Profiles and reports • Business rules • Monitoring results • Custom metrics • Business Data information • Master Data information The file storage portion of a repository can contain the following: • Data jobs • Process jobs • Match reports • Entity resolution files • Queries • Entity Definitions • Other files 2.1 Introduction All Currently Defined Repository Definitions Repository Definitions Item Selected Administration Riser Bar 14 2-19 2-20 Chapter 2 DataFlux Data Management Studio: Getting Started Creating a DataFlux Data Management Studio Repository This demonstration illustrates the steps necessary to create a DataFlux Data Management Studio repository. 1. If necessary, access the Administration riser bar. a. Select Start All Programs DataFlux Data Management Studio 2.2. b. Verify that the Home tab is selected. c. Click the Administration riser bar. The Navigation pane shows administration items when the Administration riser bar is selected. 2.1 Introduction 2. Click Repository Definitions in the list of Administration items on the Navigation pane. The information pane displays overall repository information. 3. Click New to define a new repository. a. Type Basics Demos in the Name field. b. Click Browse next to the Database file field. 1) In the Open window, navigate to S:\Workshop\dqdmp1. 2) Click New folder. 3) Type Demos as the value for the new folder name. 4) Press ENTER. 2-21 2-22 Chapter 2 DataFlux Data Management Studio: Getting Started 5) Double-click the new folder (Demos). 6) Type Demos.rps in the File name field. 7) Click Open. c. Click Browse next to the Folder field in the File storage area. 1) Navigate to S:\Workshop\dqdmp1\Demos. 2) Click Make New Folder. 3) Type files as the value for the name of the new folder. 4) Press ENTER. 5) Click OK. d. Clear Private. 2.1 Introduction The final settings for the new repository definition should resemble the following: e. Click OK. A message window appears and states that the repository does not exist. f. Click Yes. 2-23 2-24 Chapter 2 DataFlux Data Management Studio: Getting Started Information regarding the repository initialization appears in a window. g. Click Close. The repository is created and connected. 4. Update the set of default folders for this new repository. a. Verify that the Home tab is selected. b. Click the Folders riser bar. c. Click to expand the Basics Demos repository. d. Select File New Folder. 1) Type output_files. 2) Press ENTER. 2.1 Introduction e. Click the Basics Demos repository. f. Select File New Folder. 1) Type profiles_and_explorations. 2) Press ENTER. It is a best practice to use all lowercase and no spaces for folders that might be used on the DataFlux Data Management Server because some server operating systems are case sensitive. Because profiles and exploration run on a DataFlux Data Management Server, you want to adhere to this best practice. The final set of folders for the Basics Demos repository should resemble the following: 2-25 2-26 Chapter 2 DataFlux Data Management Studio: Getting Started Exercises 1. Creating a Repository for Upcoming Exercises Create a new repository to be used for the items created in the upcoming exercises. Some specifics for the new repository follow: • Name the repository Basics Exercises. • Create a database file repository named S:\Workshop\dqdmp1\Exercises\Exercises.rps. • Specify a file storage location of S:\Workshop\dqdmp1\Exercises\files. 2. Updating the Set of Default Folders for the Basics Exercises Repository Create two additional folders for the new Basics Exercises repository: • output_files • profiles_and_explorations The final set of folders should resemble the following: 2.2 Quality Knowledge Base and Reference Sources 2.2 Quality Knowledge Base and Reference Sources Objectives 23 Define a Quality Knowledge Base (QKB). Define reference sources. 2-27 2-28 Chapter 2 DataFlux Data Management Studio: Getting Started What Is a Quality Knowledge Base (QKB)? Quality Collection of files that perform the “cleansing” Knowledge of data that could involve parsing, Base standardization, matching, and more. DataFlux Locale Support 24 The Quality Knowledge Base (QKB) is a collection of files and configuration settings that contain all the DataFlux data management algorithms. DataFlux is Unicode compliant, which means that it can read data from any language. However, DataFlux only has definitions in the Quality Knowledge Base (QKB) to standardize, identify, fuzzy match, and so on for data from particular locales. Each QKB supports data management operations for a specific business area. The QKB for Contact Information supports management of commonly used contact information for individuals and organizations — for example, names, addresses, company names, and phone numbers. The supported locales for the QKB for Contact Information are indicated by dark blue (or black) above. Locales are continuously added, so contact your DataFlux representative for future locale support. In addition, the DataFlux Data Management Studio application can perform cleansing functions for noncontact information data as well. Depending on the activities you want to perform, you might need to create your own definitions in the DataFlux Quality Knowledge Base using the Data Management Studio – Customize module. This module enables you to modify existing definitions or create new definitions in the QKB. (The Data Management Studio – Customize module has its own training course.) Quality Knowledge Base locations are registered on the Administration Riser Bar in DataFlux Data Management Studio. One QKB location needs to be designated as the default QKB location. 2.2 Quality Knowledge Base and Reference Sources Summary Information for the Selected QKB Specific QKB Selected Administration Riser Bar 25 2-29 2-30 Chapter 2 DataFlux Data Management Studio: Getting Started What Is a Reference Source? Reference External source of information. A Source reference object is typically a database used to compare user data. Current valid reference data includes the following: USPS data Canada Post data Geo+Phone data World data 26 A reference object is typically a database used by DataFlux Data Management Studio to compare user data to a reference source (for example, USPS Address Data). You cannot directly access or modify references. Reference source locations are registered on the Administration riser bar in DataFlux Data Management Studio. One reference source location of each type should be designated as the default for that type. Summary Information for Reference Sources Reference Sources Item Selected Administration Riser Bar 27 2.2 Quality Knowledge Base and Reference Sources 2-31 Verifying the Course QKB and Reference Sources This demonstration illustrates the verification of the course QKB, as well as the reference sources that can be defined on the course image. 1. If necessary, select Start All Programs DataFlux Data Management Studio 2.2. DataFlux Data Management Studio appears. 2. Verify that the Home tab is selected. 3. Click the Administration riser bar. 4. Click to expand Quality Knowledge Bases. 5. Click the DataFlux QKB CI 2011A quality knowledge base. The information area displays a summary of information. 6. Click to expand Reference Sources. Two reference sources are defined. 7. Click the USPS Data item. The information area displays a summary of information. 2-32 Chapter 2 DataFlux Data Management Studio: Getting Started 8. Click the Geo+Phone Data item. The information area displays a summary of information. 2.3 Data Connections 2.3 Data Connections Objectives 35 Explore data connections. 2-33 2-34 Chapter 2 DataFlux Data Management Studio: Getting Started What Is a Data Connection? Data Connection Enables access to data in DataFlux Data Management Studio. DataFlux Data Management Studio enables five types of data connections: ODBC Connection Federated Server Connection Localized DSN Connection Custom Connection SAS Data Set Connection 36 Data connections enable you to access your data in Data Management Studio from many types of data sources. • ODBC Connection – Displays the Microsoft Windows ODBC Data Source Administrator dialog box, which you can use to create ODBC connections. • Federated Server Connection – Enables you to create a federated server data connection. A federated connection enables you to establish a threaded connection to a database. • Localized DSN Connection – Enables you to create a localized DSN connection definition for a specific data source that is available via the federated server. This connection definition is used when you access federated tables. These localized connections are federated server connections to a DBMS. They are named and created as an extra connection to the same source in metadata. • Custom Connection – Enables you to create a custom connection string for non-ODBC connection types. These custom strings are useful for establishing native connections to third-party databases or drawing data from more than one type of data input. • SAS Data Set Connection – Enables you to create SAS data set connections. 2.3 Data Connections Data Connections Item Selected Summary Information for Data Connections Data Riser Bar 37 Data Viewer 38 2-35 2-36 Chapter 2 DataFlux Data Management Studio: Getting Started Working with Data Connections This demonstration illustrates how to define and work with a data connection, including using the data viewer and generating queries. 1. If necessary, select Start All Programs DataFlux Data Management Studio 2.2. DataFlux Data Management Studio appears. 2. Verify that the Home tab is selected. 3. Click the Data riser bar. 4. Click to expand Data Connections. 5. Define a new data connection. a. Click the Data Connections item in the navigation area. b. In the information area for the selected item, click New Data Connection. c. Click ODBC Connection. The ODBC Data Source Administrator window appears. d. Click the System DSN tab. e. Click Add. f. Click the SQL Server driver. g. Click Finish to close the Create New Data Source window. The Create a New Data Source to SQL Server window appears. 2.3 Data Connections h. Type DataFlux Training in the Name field. i. Select SASBAP\SQLEXPRESS as the value for the Server field. j. Click Next. 1) Click With SQL Server authentication using a login ID and password entered by the user. 2) Type sa in the Login ID field. 3) Type sas in the Password field. k. Click Next. No changes are needed. l. Click Next. No changes are needed. m. Click Finish. 2-37 2-38 Chapter 2 DataFlux Data Management Studio: Getting Started A summary of the settings appears. n. Click Test Data Source. o. Click OK to close the test window. p. Click OK to close the setup window. 2.3 Data Connections 2-39 The new data source appears on the System DSN tab of the ODBC Data Source Administrator window. q. Click OK to close the ODBC Data Source Administrator window. The new data source appears in the Data Connection navigation area. 2-40 Chapter 2 DataFlux Data Management Studio: Getting Started 6. Investigate the new data connection named DataFlux Training. a. Click to expand DataFlux Training. The dbo, INFORMATION_SCHEMA, and sys files are structures created and maintained by SQL Server. It is desired to only show valid and usable data sources in DataFlux Data Management Studio. 2.3 Data Connections b. Right-click DataFlux Training and select Filter. 1) Click Local filter. 2) Click Schema name. 3) Type df_gifts, df_grocery in the Name field. 4) Click OK. Only the two valid, usable data sources are displayed. 2-41 2-42 Chapter 2 DataFlux Data Management Studio: Getting Started 7. If necessary, save the connection information for the two dfConglomerate data connections. a. Click the Data Connections item in the navigation pane. The information area displays information for all data connections. b. Right-click dfConglomerate Gifts and select Save User Credentials. c. Right-click dfConglomerate Grocery and select Save User Credentials. This action saves the data source connection information. If the data source connection requires a username and password, you are prompted for that information. It is a recommended best practice to save connection information. 8. Use the data viewer to explore data. a. In the Data Connections navigation area, click connection. to expand the dfConglomerate Gifts data 2.3 Data Connections b. Click the Customers table object. The information area is populated with information about the selected table. c. Click the Data tab. 2-43 2-44 Chapter 2 DataFlux Data Management Studio: Getting Started d. Click the tool on the Data tab. A Sort window appears. 1) Click the JOB TITLE field in the Available Fields pane. 2) Click to move the JOB TITLE field to the Selected Fields pane. 3) Select Descending in the Direction field. 4) Click OK. 2.3 Data Connections The data is now displayed in sorted order (by descending JOB TITLE). This symbol indicates that a sort occurred and the sort direction. 9. Click the Fields tab. This tab can be used to examine the information displayed for each field in the selected table. 2-45 2-46 Chapter 2 DataFlux Data Management Studio: Getting Started 10. Create a stand-alone query using the Customers table. a. In the Navigation area, right-click the Customers table and select Query Table. b. Type Ch2D4_CustomersMissingEmail_Query in the Name field. c. If necessary, navigate to Basics Demos profiles_and_explorations in the Save in field. d. Click OK. The new query appears on a tab. 2.3 Data Connections The query can be further customized in this interface: Query Flow tab Nodes and Data riser bars Query Flow Editor Details Pane SQL Navigation panel 2-47 2-48 Chapter 2 DataFlux Data Management Studio: Getting Started 11. Change the orientation of the flow by clicking Left To Right. 12. Add a WHERE clause to the generated SQL query. a. Locate the Nodes riser bar. b. Double-click the Where keyword under the Clauses grouping. The flow resembles the following: c. Right-click the newly added Where node and select Properties. d. Click in the first Operand field and select Select Field. The Select Field window appears. 2.3 Data Connections 1) Click to expand the Customers table. 2) Click EMAIL. The Select Field window displays a sample of the selected field’s values. 3) Click OK. e. Click in the Operator field and select IS NULL. 13. Select File Save to save the query. 2-49 2-50 Chapter 2 DataFlux Data Management Studio: Getting Started 14. Select Actions Preview to preview the result set. The Details pane displays the Preview tab. 15. Click the Code tab on the Details pane to view the generated SQL code. 16. Select View Show Details Pane to toggle off the Details pane. 2.3 Data Connections 17. Click the Code sub-tab. 18. Select View Editor Type Code. A Warning window appears. 19. Click Yes. The editor changes from the Visual style to the Code style. 20. Click to expand the dfConglomerate Gifts. 21. Click the Customers table. 2-51 2-52 Chapter 2 DataFlux Data Management Studio: Getting Started 22. Click the Functions riser bar. There are many types of available functions that can be used to customize the SQL Query further. 23. Select File Close to close the query. 2.3 Data Connections Exercises 3. Viewing and Graphing a Database Table Access the BREAKFAST_ITEMS table in the dfConglomerate Grocery data connection. • View the data on the Data tab and perform a sort by the columns BRAND and NAME. • Review the field attributes on the Fields tab. QUESTION: How many fields are in the BREAKFAST_ITEMS table? Answer: QUESTION: How many fields are character? Answer: QUESTION: How many fields are numeric? Answer: • Create a graph using the Graph tab with the following specifications: Chart type: Area X axis: NAME Y axis: SIZE QUESTION: What NAME value has the highest value (or sum of values) of SIZE for the sample defined by the default “row count range” of 30? Answer: 2-53 2-54 Chapter 2 DataFlux Data Management Studio: Getting Started 2.4 Solutions to Exercises 1. Creating a Repository for Upcoming Exercises a. If necessary, select Start All Programs DataFlux Data Management Studio 2.2. b. Verify that the Home tab is selected. c. Click the Administration riser bar. d. Click Repository Definitions in the list of Administration items on the Navigation pane. e. Click New to define a new repository. 1) Type Basics Exercises in the Name field. 2) Click Browse next to the Database file field. a) In the Open window, navigate to S:\Workshop\dqdmp1. b) Click New folder. c) Type Exercises as the new folder name. d) Press ENTER. e) Double-click the new folder (Exercises). f) Type Exercises.rps in the File name field. g) Click Open. 3) Click Browse next to the Folder field in the File Storage area. a) Navigate to S:\Workshop\dqdmp1\Exercises. b) Click Make New Folder. c) Type files as the name of the new folder. d) Press ENTER. e) Click OK. 4) Clear Private. 5) Click OK. A message window appears and states the repository does not exist. 6) Click Yes. Information regarding the repository initialization appears in a window: 2.4 Solutions to Exercises 7) Click Close. The repository is created and connected: 2. Updating the Set of Default Folders for the Basics Exercises Repository a. If necessary, select Start All Programs DataFlux Data Management Studio 2.2. b. Verify that the Home tab is selected. c. Click the Folders riser bar. d. Click to expand the Basics Exercises repository. e. Select File New Folder. 1) Type output_files. 2) Press ENTER. f. Click the Basics Exercises repository. g. Select File New Folder. 1) Type profiles_and_explorations. 2) Press ENTER. The final set of folders for the Basics Exercises repository should resemble the following: 2-55 2-56 Chapter 2 DataFlux Data Management Studio: Getting Started 3. Viewing and Graphing a Database Table a. If necessary, select Start All Programs DataFlux Data Management Studio 2.2. b. Verify that the Home tab is selected. c. Click the Data riser bar. d. Click to expand the Data Connections item. e. Click to expand the dfConglomerate Grocery data connection. f. Click the BREAKFAST_ITEMS table object. The information area is populated with information about the selected table. 2.4 Solutions to Exercises g. Click the Data tab. h. Sort the data by BRAND and NAME. 1) Click the tool on the Data tab. A Sort window appears. 2) Click the BRAND field in the Available Fields pane. 3) Click to move the BRAND field to the Selected Fields pane. 4) Click the NAME field on the Available Fields pane. 5) Click 6) Click OK. to move the NAME field to the Selected Fields pane. 2-57 2-58 Chapter 2 DataFlux Data Management Studio: Getting Started The data is displayed in sorted order (by BRAND and then by NAME). The 1 designates a primary ascending sort. The 2 designates a secondary ascending sort. i. Click the Fields tab. QUESTION: How many fields are there in the BREAKFAST_ITEMS table? Answer: Fifteen (15) QUESTION: How many fields are character? Answer: Seven (7) QUESTION: How many fields are numeric? Answer: Eight (8) 2.4 Solutions to Exercises j. Click the Graph tab. 1) Click in the Chart type field and select Area. 2) Click in the X-axis field and select NAME. 3) Click in the Y-axis field and select SIZE. QUESTION: What NAME value has the highest value (or sum of values) of SIZE for the sample defined by the default “row count range” of 30? Answer: MALT-O-MEAL HONEY BUZZERS (21.8) 2-59 2-60 Chapter 2 DataFlux Data Management Studio: Getting Started Chapter 3 PLAN 3.1 Creating Data Collections .............................................................................................. 3-3 Demonstration: Creating Collections of Address and Note Fields ........................................ 3-5 Exercises .............................................................................................................................. 3-13 Demonstration: Creating and Exploring a Data Exploration ............................................... 3-18 3.3 Creating Data Profiles .................................................................................................. 3-32 Demonstration: Creating and Exploring a Data Profile ........................................................ 3-39 Exercises .............................................................................................................................. 3-53 Demonstration: Profiling Data Using Text File Input ........................................................... 3-56 Demonstration: Profiling Data Using Filtered Data and an SQL Query .............................. 3-61 Demonstration: Profiling Data Using a Collection ............................................................... 3-68 Exercises .............................................................................................................................. 3-70 Demonstration: Data Profiling – Additional Analysis (Self-Study) ........................................ 3-73 3.4 Designing Data Standardization Schemes ................................................................. 3-83 Demonstration: Creating a Phrase Standardization Scheme ............................................. 3-85 Demonstration: Creating an Element Standardization Scheme ......................................... 3-92 Demonstration: Importing a Scheme from a Text File ......................................................... 3-96 Exercises .............................................................................................................................. 3-98 3.5 Solutions to Exercises ................................................................................................. 3-99 3-2 Chapter 3 PLAN 3.1 Creating Data Collections 3.1 Creating Data Collections Objectives Define a data collection. 3 What Is a Data Collection? Data Collection A set of data fields from different table of different data connections A data collection provides a convenient way to build a data source using desired fields can be used as an input source for Data Viewer, explorations, profiles, and queries. 4 4 3-3 3-4 Chapter 3 PLAN Selected Collection Two Collections in Basics Demos Repository Data Riser Bar 5 5 You can use data collections to group data fields from different tables and/or database connections. These collections can be used as input sources for data explorations and profiles. 3.1 Creating Data Collections Creating Collections of Address and Note Fields This demonstration illustrates the steps necessary to create a data collection. Two collections are built: one with address information and one with notes. 1. If necessary, select Start All Programs DataFlux Data Management Studio 2.2. DataFlux Data Management Studio appears. 2. Verify that the Home tab is selected. 3. Click the Data riser bar. 4. Click to expand the Collections grouping. 5. Click the Basics Demos repository. 6. Create a new collection of address information. a. Click (or right-click Basics Demos and select New Collection). b. Type Address Info in the Name field. c. Type Collection of Address Fields in the Description field. d. Click OK. The new collection appears on a tab. 3-5 3-6 Chapter 3 PLAN 7. Insert various address fields. (the Insert Fields tool). a. Click The Insert Fields window appears. b. Add address fields from the dfConglomerate Gifts data connection. 1) Click next to the Connection field. 2) Click dfConglomerate Gifts. 3) Click OK. 4) Add fields from the Customers table. a) Click in front of the Customers table. b) Click the ADDRESS field. (Click the check box.) c) Click on the ADDRESS field name. A data sample of the ADDRESS field appears on the right of the Insert Fields window. d) Click the CITY field. (Click the check box.) e) Click the COUNTRY/REGION field. (Click the check box.) f) Click the STATE/PROVINCE field. (Click the check box.) g) Click the ZIP/POSTAL CODE field. (Click the check box.) 3.1 Creating Data Collections All the desired fields from the Customers table are selected. h) Click Add. i) Clear the selections for the Customers table. 5) Add fields from the Employees table. a) Click in front of the Employees table. b) Click the ADDRESS field. (Click the check box.) c) Click the CITY field. (Click the check box.) d) Click the COUNTRY/REGION. (Click the check box.) e) Click the STATE/PROVINCE field. (Click the check box.) f) Click the ZIP/POSTAL CODE field. (Click the check box.) g) Click Add. h) Clear the selections for the Employees table. 3-7 3-8 Chapter 3 PLAN 6) Add fields from the Orders table. a) Click in front of Orders table. b) Click the SHIP ADDRESS field. (Click the check box.) c) Click the SHIP CITY field. (Click the check box.) d) Click the SHIP COUNTRY/REGION field. (Click the check box.) e) Click the SHIP STATE/PROVINCE field. (Click the check box.) f) Click the SHIP ZIP/POSTAL CODE field. (Click the check box.) g) Click Add. h) Clear the selections for the Orders table. c. Add address fields from the dfConglomerate Grocery data connection. 1) Click next to the Connection field. 2) Click dfConglomerate Grocery. 3) Click OK. 4) Add fields from the MANUFACTURERS table. a) Click in front of the MANUFACTURERS table. b) Click the CITY field. (Click the check box.) c) Click the CONTACT_ADDRESS field. (Click the check box.) d) Click the CONTACT_CITY field. (Click the check box.) e) Click the CONTACT_CNTRY field. (Click the check box.) f) Click the CONTACT_POSTAL_CD field. (Click the check box.) g) Click the CONTACT_STATE_PROV field. (Click the check box.) h) Click the COUNTRY field. (Click the check box.) i) Click the POSTAL CD field. (Click the check box.) j) Click the STATE_PROV field. (Click the check box.) k) Click the STREET_ADDR field. (Click the check box.) l) Click Add. 5) Click Close to close the Insert Fields window. 3.1 Creating Data Collections The Address Info collection should now resemble the following: 8. Select File Close Collection. 9. Create a new collection of note information. a. Click the Basics Demos repository. b. Click (or right-click Basics Demos and select New Collection). c. Type Descriptions in the Name field. d. Type Collection of Descriptive Fields in the Description field. e. Click OK. 3-9 3-10 Chapter 3 PLAN The new collection appears on a tab. 10. Insert various address fields. a. Click (the Insert Fields tool). The Insert Fields window appears. b. Click the newly defined Descriptions collection in the Navigation pane. c. Click Insert Fields in the information pane for the Descriptions collection. The Insert Fields window appears. d. Add notes fields from the dfConglomerate Gifts data connection. 1) Click next to the Connection field. 2) Click dfConglomerate Gifts. 3) Click OK. 4) Add fields from the Customers table. a) Click in front of the Customers table. b) Click the NOTES field. (Click the check box.) c) Click Add. d) Clear the selections for the Customers table. 5) Add fields from the Employees table. a) Click in front of the Employees table. b) Click the NOTES field. (Click the check box.) c) Click Add. d) Clear the selections for the Employees table. 6) Add fields from the Orders table. a) Click in front of the Orders table. b) Click the NOTES field. (Click the check box.) c) Click Add. d) Clear the selections for the Orders table. 3.1 Creating Data Collections e. Add notes fields from the dfConglomerate Grocery data connection. 1) Click next to the Connection field. 2) Click dfConglomerate Grocery. 3) Click OK. 4) Add fields from the MANUFACTURERS table. a) Click in front of the MANUFACTURERS table. b) Click the NOTES field. (Click the check box.) c) Click Add. 5) Click Close to close the Insert Fields window. The Descriptions collection should now resemble the following: 11. Select File Close Collection. The two collections are displayed for the Basics Demos repository. 3-11 3-12 Chapter 3 PLAN 3.1 Creating Data Collections 3-13 Exercises 1. Creating a Dates Collection Create a collection named Dates and add the date fields from all the tables in the dfConglomerate Gifts and dfConglomerate Grocery data connections. Add the collection to the Basics Exercises repository. QUESTION: How many date fields were added to the collection? Answer: 2. Creating a Phone Numbers Collection Create a collection named Phone Numbers and add the various phone number fields from all the tables in the dfConglomerate Gifts and dfConglomerate Grocery data connections. Add the collection to the Basics Exercises repository. QUESTION: How many phone number fields were added to the collection? Answer: 3-14 Chapter 3 PLAN 3.2 Designing Data Explorations Objectives 14 Define and investigate data explorations. 3.1 Creating Data Collections 3-15 What Is a Data Exploration? Data Exploration A data exploration reads data from databases and categorizes the fields in the selected tables into categories. Data explorations enable you and your organization to identify data redundancies, and extract and organize metadata from multiple sources. Relationships between metadata can be identified and cataloged into types of data by specified business data types and processes. 15 A data exploration reads data from databases and categorizes the fields in the selected tables into categories. These categories are predefined in the Quality Knowledge Base (QKB). Data explorations perform this categorization by matching column names. You also have the option of sampling the data in the table to determine whether the data is one of the specific types of categories in the QKB. For example, your customer metadata might be grouped into one catalog and your address metadata might be grouped in another catalog. After you organize your metadata into manageable chunks, you can identify relationships between the metadata by table-level profiling. Creating a data exploration enables you to analyze tables in databases to locate potential matches and plan for the profiles that you need to run. After you identify possible matches, you can plan the best way to handle your data and create a profile job for any database, table, or field. Thus, you can use a data exploration of your metadata to decide on the most efficient and profitable way to profile your physical data. 3-16 Chapter 3 PLAN Data Exploration Map 16 1 6 The Data Exploration map is used to quickly review the relationships between all of the databases, tables, and fields that are included in a Data Exploration project. Data Exploration Report – Field Match 17 1 7 The Field Match report displays a list of the database fields that match a selected field. 3.1 Creating Data Collections 3-17 Data Exploration Report – ID Analysis 18 1 8 The Identification Analysis report lists the database fields that match categories in the definitions that you selected for field name and sample data analysis. Data Exploration Report – Table Match 19 1 9 The Table Match report displays a list of database tables that match a selected table or field. 3-18 Chapter 3 PLAN Creating and Exploring a Data Exploration This demonstration illustrates creating a data exploration, and then examining the resultant report. 1. If necessary, select Start All Programs DataFlux Data Management Studio 2.2. 2. Verify that the Home tab is selected. 3. Click the Folders riser bar. 4. If necessary, click to expand the Basics Demos repository. 5. Right-click the profiles_and_explorations folder and select New Data Exploration. a. Type Ch3D2_dfConglomerate_Data_Exploration in the Name field. b. Click OK. The new data exploration appears on a tab. 3.1 Creating Data Collections 6. Verify that the Properties tab is selected. 7. Define the data exploration specifics. a. In the Data Sources pane, click Add Table. The Add Tables window appears. b. Click dfConglomerate Gifts. c. Click dfConglomerate Grocery. d. Click Add. e. Click Close to close the Add Tables window. 3-19 3-20 Chapter 3 PLAN f. Verify that seven tables appear in the Data sources area. g. In the Analysis Methods pane, click Field name matching. 1) Click once in the Locale field. This reveals a selection tool. 2) Click in the Locale field and select English (United States). 3) Click once in the Match Definition field. This reveals a selection tool. 4) Click in the Match Definition field and select Field Name. 5) Click once in the Sensitivity field. This reveals a selection tool. 6) Click in the Sensitivity field and select 75. 3.1 Creating Data Collections h. In the Analysis Methods pane, click Field name analysis. 1) Click once in the Locale field. This reveals a selection tool. 2) Click in the Locale field and select English (United States). 3) Click once in the Identification Definition field. This reveals a selection tool. 4) Click i. in the Identification Definition field and select Field Name. In the Analysis Methods pane, click Sample data analysis. 1) Specify 500 as the value for the Sample size(records) field. 2) Click once in the Locale field. This reveals a selection tool. 3) Click in the Locale field and select English (United States). 4) Click once in the Identification Definition field. This reveals a selection tool. 5) Click in the Identification Definition field and select Contact Info. 8. Select Actions Run (or click ) to execute the data exploration. The status of the processing appears: 3-21 3-22 Chapter 3 PLAN 9. Explore the Data Exploration map results. a. Click the Map tab. The Field Relationship map appears: The outer ring represents the selected data connections. The inner ring represents a table from a data connection. If you move your cursor over a segment of the inner ring, the table name appears. The dots represent the fields. 3.1 Creating Data Collections b. Locate the MANUFACTURERS table from the dfConglomerate Grocery data connection. c. Locate the dot representing the STREET_ADDR field. 3-23 3-24 Chapter 3 PLAN d. Click the dot representing the STREET_ADDR field. A sample of data values appears in the panel to the right. Four other fields are related to this STREET_ADDR field. QUESTION: What are the names of the four other fields? QUESTION: What tables do the four fields appear in? Answer: ADDRESS from the Customers table in the dfConglomerate Gifts data source ADDRESS from the Employees table in the dfConglomerate Gifts data source SHIP_ADDRESS from the Orders table in the dfConglomerate Gifts data source CONTACT_ADDRESS from the MANUFACTURERS table in the dfConglomerate Grocery data source 3.1 Creating Data Collections 10. Explore the Data Exploration report information. a. Click the Report tab. b. Click in front of All Databases. c. Double-click the dfConglomerate Grocery data connection. d. Click the MANUFACTURERS table. 3-25 3-26 Chapter 3 PLAN e. Double-click the MANUFACTURERS table. f. Double-click the STREET_ADDR field. The matching fields’ information appears. g. Click the Identification Analysis riser bar. 1) Click in front of All Identification Definitions. 2) Click in front of Field Name. 3.1 Creating Data Collections 3-27 3) Click the category of ADDRESS. The fields are identified as ADDRESS using the metadata identification definition display. a) Select all fields. (1) Click the first field. (2) Hold down the SHIFT key. (3) Click the last field. 4) Right-click one of the selected fields and select Add To Collection Address Info. 3-28 Chapter 3 PLAN 5) Click Collections in the resource pane. 3.1 Creating Data Collections 6) If necessary, click in front of Basics Demos. 7) Click the Address Info collection. 8) Click (Open tool). 9) Verify that the “address” fields are part of this collection. 10) Select File Close Collection. 3-29 3-30 Chapter 3 PLAN h. Click the Table Match riser bar. 1) Click in front of All Databases. 2) Click in front of dfConglomerate Grocery. 3) Click the MANUFACTURERS table. 3.1 Creating Data Collections 4) Click the Customers table. The possible relationships between MANUFACTURERS and Customers are displayed. 11. Select File Close Exploration. The tab for the data exploration is closed. 3-31 3-32 Chapter 3 PLAN 3.3 Creating Data Profiles Objectives Define and investigate data profiles. 28 What Is a Data Profile? Data Profile Provides the ability to inspect data for errors, inconsistencies, redundancies, and incomplete information. When data are profiled, the following tasks can be performed: Improve understanding of existing databases. Identify issues early in the data management process, when they are easier and less expensive to manage. Determine which steps need to be taken to address data problems. 29 2 9 3.3 Creating Data Profiles Profile Properties PK/FK Analysis Metric Overrides Identify Tables within Data Connections to Profile 3 0 30 Redundant Data Analysis Alert Identify Columns from Selected Table to Profile Data Profiling Metrics Selecting Tools Default Profile Metrics (from the Properties tab) enables the selection of metrics to be calculated for all fields. Selecting a particular field (from the Properties tab), and then selecting Actions Override Metrics enables the default metrics to be overridden for that field. 31 The first four metric selections take additional processing time to calculate, so be judicious in their selection. 3-33 3-34 Chapter 3 PLAN Data Profile Report – Standard Metrics 32 Standard metrics include information such as the following: • Record count • Null count • Blank count • Non-null count • Calculations such as mean, median, mode, standard deviation, standard error • Min/max value • Pattern count • Primary key candidate • Nullable/percent null • Decimal places 3.3 Creating Data Profiles 3-35 Data Profile Report – Visualizations 33 Visualizations are customized charts that you create based on your data and the metrics that you apply. Data Profile Report – Notes 34 Notes can be added to your report to aid in your planning process. 3-36 Chapter 3 PLAN Data Profile Report – Frequency Distributions 35 Distributions can be calculated for the frequency counts of the actual values of a field. To view the record(s) where a particular distribution occurs, double-click on the value. The list of distribution values can also be filtered and visualized. 3.3 Creating Data Profiles 3-37 Data Profile Report – Pattern Frequencies 36 Distributions can be calculated for the pattern frequency counts (the pattern of the value in the field). To view the record(s) where a particular pattern distribution occurs, double-click on the pattern distribution value. The list of pattern distribution values can also be filtered and visualized. For a pattern, the following rules apply: • “A” represents an uppercase letter • “a” represents a lowercase letter • “9” represents a digit. 3-38 Chapter 3 PLAN Data Profile Report – Percentiles 37 Percentiles provide a numeric layout of your data at a percentile interval that you specify. The percentile interval is specified when you set your metrics. Data Profile Report – Outliers 38 The Outliers tab lists the top X minimum and maximum value outliers. The number of listed minimum and maximum values are specified when you set your metrics. 3.3 Creating Data Profiles Creating and Exploring a Data Profile This demonstration illustrates the basics of setting the properties for a data profile, running the data profile, and then viewing various aspects of the data profile report. 1. If necessary, select Start All Programs DataFlux Data Management Studio 2.2. DataFlux Data Management Studio appears. 2. Click the Folders riser bar. 3. If necessary, click to expand the Basics Demos repository. 4. Right-click the profiles_and_explorations folder and select New Profile. a. Type Ch3D3_dfConglomerate_Profile field. b. Click OK. The new profile appears on a tab. 3-39 3-40 Chapter 3 PLAN 5. Define the profile properties. a. Verify that the Properties tab is selected. b. Click the dfConglomerate Gifts check box. An X appears in the check box. c. Click the dfConglomerate Grocery check box. An X appears in the check box. QUESTION: How many tables total are selected? Answer: Seven (7) total QUESTION: How do you select only some of the fields for a particular table? Answer: Click to expand the data connection. Then click each of the desired tables. 3.3 Creating Data Profiles 6. Review profile options. a. Select Tools Profile Options. b. Review the options on the General tab. c. Click the Charts tab. d. Review the options on the Charts tab. 3-41 3-42 Chapter 3 PLAN e. Click the Quality Knowledge Base tab. f. Review the options on the Quality Knowledge Base tab. g. Click Cancel to close the Options window. 7. Define profile options. a. Select Tools Default Profile Metrics. b. Verify that all metrics are selected. The first four metric selections take additional processing time to calculate, so be judicious in their selection. c. Click OK to close the Metrics window. 8. Select File Save Profile to save the profile to this point. 3.3 Creating Data Profiles 9. Override metrics for selected columns. a. Click in front of dfConglomerate Grocery. b. Click the MANUFACTURERS table. c. Click the ID field. d. Hold down the CTRL key and click the NOTES field. e. Right-click one of the selected columns and select Override Metrics. f. Clear the selection for Frequency distribution. g. Clear the selection for Pattern frequency distribution. h. Clear the selection for Percentiles. i. Clear the selection for Outliers. j. Click OK to close the Metrics window. The check marks under the M column denote that metrics were overridden for these fields. 10. Select File Save Profile to save the profile. 3-43 3-44 Chapter 3 PLAN 11. Select Actions Run Profile Report. a. Type First profile in the Description field. b. Click OK to close the Run Profile window. The profile is executed. The status of the execution is displayed. The Report tab becomes active. 12. Review the Profile report. a. Click in front of dfConglomerate Gifts. 3.3 Creating Data Profiles 3-45 b. Click the Customers table. Some metrics display (not applicable), which indicates that the metric calculation is not applicable to that field type. 3-46 Chapter 3 PLAN c. Click in front of the Customers table. d. Click the COUNTRY/REGION field. There are 63 values. There are 4 unique values. There are 4 patterns to the values. e. Click the Frequency Distribution tab. 3.3 Creating Data Profiles f. Click the Pattern Frequency Distribution tab. Almost 83% of the data has the pattern of three capital letters. It would be worthwhile to investigate fixing the patterns of the nine observations that do not have the AAA pattern. g. Select Insert New Note. 1) Type Check the patterns of this field. in the Add Note window. 2) Click OK to close the Add Note window. 3) Verify the Notes tab is now active and the new note appears. This is a field-level note. h. Click the Outliers tab. This tab displays the top X minimum and maximum values. 3-47 3-48 Chapter 3 PLAN i. Add a table-level note. 1) Click the Customers table. 2) Select Insert New Note. 3) Type Table level note with reminders on what to check. in the Add Note window. 4) Click OK to close the Add Note window. 5) With the Customers table selected, click the Notes tab. 3.3 Creating Data Profiles 3-49 13. Investigate visualizations. a. With the Customers table selected, click the Visualizations tab. b. Click to the right of the Chart field. c. Type Visual comparison of metrics across fields for Customer table in the Description field. d. Verify that the Chart type is set to Bar. e. Verify that the Data type is set to Field metrics. f. Click Select/unselect all under the Fields pane. (This selects all fields.) g. Click Count, Unique Count, Null Count, Blank Count, and Pattern Count as the choices in the Metrics pane. The final settings should resemble the following: h. Click OK to close the Chart Properties window. 3-50 Chapter 3 PLAN The Chart window appears. The chart can be saved as .jpg, .png, .gif, and .bmp. The chart can be printed (if a valid printer connection is established). 3.3 Creating Data Profiles i. Right-click the background of the chart and select Chart Type. If the desired visualization is not understandable as a bar chart, it can easily be changed by choosing one of the other available types. 3-51 3-52 Chapter 3 PLAN 14. Investigate a different table that was also profiled in this job. a. Click in front of the dfConglomerate Grocery data connection. b. Click in front of the MANUFACTURERS table object. c. Click the NOTES field. The Column Profiling metrics for the NOTES field is displayed. d. Click the Frequency Distribution tab. Recall that this option was cleared in a metric override for this field. Similarly, the tabs for Pattern Frequency Distribution, Percentiles, and Outliers display (Not calculated) because these options were also cleared in the metric override for this field. 15. Select File Close Profile. 3.3 Creating Data Profiles 3-53 Exercises 3. Profiling the Tables in dfConglomerate Grocery Profile the two tables (MANUFACTURERS and BREAKFAST_ITEMS) in the dfConglomerate Grocery data connection. Some specifics follow: • Use the Basics Exercises repository to store the profile. (Use the profile_and_explorations folder.) Provide a name of Ch3E3_dfConglomerate_Grocery_Profile. • Calculate all metrics on all fields (initial setup). • For the ID fields in both the BREAKFAST_ITEMS and MANUFACTURERS tables, do not calculate frequency distribution, a pattern frequency distribution, percentiles, or outliers. • Run the profile, and provide a description of Initial Profile. • Review the report. QUESTION: How many distinct values exist for the UOM field in the BREAKFAST_ITEMS table? Answer: QUESTION: What are the distinct values for the UOM field in the BREAKFAST_ITEMS table? Answer: QUESTION: What is the ID field value for the records with PK as a value for the UOM field in the BREAKFAST_ITEMS table? Answer: • Filter the list of distinct values to show only those with a frequency count more than 5. Hint: Select View Report Filter. • Add a note to the UOM field. Type Standardize the UOM field. After it is added, verify that the note appears in the UOM field. • View the pattern frequency distribution for the PHONE field in the MANUFACTURERS table. View the records that match the pattern (999)999-9999. Add a note saying to standardize this PHONE field. • Create a line graph to visually explore the comparison between the minimum length and maximum length metrics. Compare these metrics for all fields in the MANUFACTURERS table. 3-54 Chapter 3 PLAN 4. Profiling a Selection of Tables in a DataFlux Sample Profile all fields from four tables (Client_Info, Contacts, Product, and Product_Sales) in the DataFlux Sample data connection. Some specifics follow: • Use the Basics Exercises repository to store the profile. (Use the profile_and_explorations folder.) Provide a name of Ch3E4_DataFluxSample_Profile. • Calculate all column profiling metrics on all fields. (Do not calculate frequency distributions, pattern frequency distribution, percentiles, and outliers on all fields.) • Override the default metrics for the Phone and State fields of the Client_Info table. In addition, calculate frequency distribution and pattern frequency distribution. • Override the default metrics for the PHONE and STATE fields of the Contacts table. In addition, calculate frequency distribution and pattern frequency distribution. • Override the default metrics for the WEIGHT_OZS field of the Product table. In addition, calculate frequency distribution, pattern frequency distribution, percentiles (with default interval), and outliers (with default minimum and maximum specifications). • Run the profile and answer the following questions: QUESTION: How many distinct values exist for the STATE field in the Contacts table? Answer: QUESTION: How many distinct patterns exist for the STATE field in the Contacts table? Answer: QUESTION: What does a bar chart graphic (visualization) tell you about the comparison of the data length and maximum length metrics for all fields in the Contacts table? (Do not include the DATE, DELETE_FLG, ID, and MATCH_CD fields in the graphic view.) Answer: QUESTION: What is the ID field value for the record(s) with a “bad” PHONE field pattern in the Contacts table? Add a field level note saying to fix this value. Answer: • Add a table level note saying to check for patterns for the STATE and PHONE fields in the Contacts table. 3.3 Creating Data Profiles Data Profiling – Other Input Types Data profiling can be performed on data connections as well as the following: SQL Queries Text Files Filtered Data Collections 46 A sample interval can be specified for any input type. By default the sample interval is 1. However, another interval can be specified by selecting Actions Change Sample Interval. 3-55 3-56 Chapter 3 PLAN Profiling Data Using Text File Input This demonstration illustrates the steps necessary to use a text file as an input data source for a data profile. 1. If necessary, select Start All Programs DataFlux Data Management Studio 2.2. DataFlux Data Management Studio appears. 2. Click the Folders riser bar. 3. Click the Basics Demos repository. 4. Click Profile. a. Double-click the profiles_and_explorations folder. (This action makes this folder the value of the Save in field.) b. Type Ch3D4_TextFile_Profile in the Name field. c. Click OK. The new profile appears on a tab. 5. On the Properties tab, click Text Files. This action enables the Insert Text File menu choice. 3.3 Creating Data Profiles 6. Select Insert New Text File. a. Type Gift Tradeshow Information in the Table name field. b. Click OK to close the Insert Text File window. 7. Specify delimited file information. a. Click next to the Input file name field. The Open window appears. b. Navigate to S:\Workshop\dqdmp1\data\Text Files. c. Click the file Gift_Tradeshow_List.txt. d. Click Open. The pathname and filename are returned to the Input file name field. e. Click f. in the Text qualifier field and select " (double quotation mark). Verify that the default delimiter is set to Comma. g. Click the Number of rows to skip check box and verify that the default value is set to one. 3-57 3-58 Chapter 3 PLAN h. Click Suggest under the Fields area. 1) Click the First row contains field names check box. 2) Click the Guess field types and lengths from the file content check box. 3) Click OK. The Fields area is populated with information about the fields in the text file. The final settings for the Delimited File Information window should resemble the following: i. Click OK to close the Delimited File Information window. 3.3 Creating Data Profiles 3-59 The field information is returned to the profile. The fields here can be “processed” just as you did for a table source. 8. If necessary, select all fields by clicking (check box) in front of Gift Tradeshow Information. 9. Select all metrics. a. Select Tools Default Profile Metrics. b. Verify that all metrics are selected. c. Click OK to close the Metrics window. 10. Select File Save Profile to save the profile. 3-60 Chapter 3 PLAN 11. Select Actions Run Profile Report. a. Type Profiling a text file in the Description field. b. Click OK to close the Run Profile window. The profile executes. The status of the execution is displayed. The report tab becomes active. 12. Review the profile report. a. Click in front of Gift Tradeshow Information to view the available fields. b. Click the STATE field. Notice the Pattern Count metric (9). c. Click the Pattern Frequency Distribution tab. d. Double-click the Aa. pattern. This drill-through action goes back to the text file to get the records. e. Click Close to close the drill-through window. 13. Select File Close Profile. 3.3 Creating Data Profiles 3-61 Profiling Data Using Filtered Data and an SQL Query This demonstration illustrates the steps necessary to access and filter data as input for a data profile. In addition, input for a data profile is an SQL query. 1. If necessary, select Start All Programs DataFlux Data Management Studio 2.2. DataFlux Data Management Studio appears. 2. Click the Folders riser bar. 3. Click the Basics Demos repository. 4. Click Profile. a. Double-click the profiles_and_explorations folder. (This action makes this folder the value of the Save in field.) b. Type Ch3D5_FilterAndSQLQuery_Profile in the Name field. c. Click OK. The new profile appears on a tab. 5. Set up the SQL query. a. On the Properties tab, click in front of the dfConglomerate Gifts data connection. b. Click the Products table object. c. Select Insert New SQL Query. d. Type Miscellaneous Products _SQL_Query_ in the SQL Query name field. e. Click OK to close the Insert SQL Query window. 3-62 Chapter 3 PLAN f. Type a valid SQL query. 1) Type the following as the value for the SQL Query field: Select Products.* from Products where Products.CATEGORY = 'Miscellaneous' 2) Click Test. g. Click OK to close the information window. h. Click OK to close the SQL Query window. The field information is returned to the profile. The fields here can be “processed” as you did for a table source. 3.3 Creating Data Profiles 6. Select all fields for the SQL query results by clicking Products _SQL_Query_. (check box) in front of Miscellaneous 7. Set up the filtered data. a. On the Properties tab, if necessary, click connection. in front of the dfConglomerate Gifts data b. Click the Products table object. c. Select Insert New Filtered Table. d. Type Miscellaneous Products _Filter_ in the Filter name field. e. Click OK to close the Insert Filter window. The Filter on Table window appears. 3-63 3-64 Chapter 3 PLAN f. Specify the filter. 1) Click in the Field name field and select CATEGORY. 2) Click in the Operation field and select equal to. 3) Type Miscellaneous in the Single value field. 4) Click Add Condition. 5) Click OK to close the Filter on Table window. The field information is returned to the profile. The fields here can be “processed” as you did for a table source. 8. Select all fields for the filter information by clicking Products _Filter_. (check box) in front of Miscellaneous 3.3 Creating Data Profiles 9. Select all metrics. a. Select Tools Default Profile Metrics. b. Verify that all metrics are selected. c. Click OK to close the Metrics window. 10. Select File Save Profile to save the profile. 11. Select Actions Run Profile Report. a. Type Profiling filtered data and query results in the Description field. b. Click OK to close the Run Profile window. The profile executes. The status of the execution is displayed. The Report tab becomes active. 12. Review the profile report. a. Click in front of Miscellaneous Products _SQL_Query_ to view the available fields. b. Click in front of Miscellaneous Products _Filter_ to view the available fields. 3-65 3-66 Chapter 3 PLAN c. Click the CATEGORY field for Miscellaneous Products _SQL_Query_. Notice the Unique Count metric. d. Click the Frequency Distribution tab. e. Click the CATEGORY field for the Miscellaneous Products _Filter_. f. Click the Column Profiling tab. Notice the Unique Count metric. 3.3 Creating Data Profiles g. Click the Frequency Distribution tab. 13. Continue to investigate the profile report (if desired). IMPORTANT: The data generated for both the SQL query and filter have the same results. The filter pulled all records. The filter was processed on the machine that the profile was run on. The database does the filtering for an SQL query. 14. Select File Close Profile. 3-67 3-68 Chapter 3 PLAN Profiling Data Using a Collection This demonstration illustrates the steps necessary to access data from a collection as input for a data profile. 1. If necessary, select Start All Programs DataFlux Data Management Studio 2.2. DataFlux Data Management Studio appears. 2. Click the Data riser bar. 3. Click in front of Collections to view the available collections. 4. Click in front of Basics Demos to view the available collections. 5. Click the Descriptions collection. 6. Click (Profile tool) from the toolbar at the top of the Data navigation pane. a. Double-click the profiles_and_explorations folder. (This action makes this folder the value of the Save in field.) b. Type Ch3D6_Collection_Profile_DESCRIPTIONS in the Name field. c. Click OK. The new profile appears on a tab. 7. Verify that the “description” fields are selected from various data connections. a. Verify that four tables are selected. b. Click the Customers table and verify that the NOTES field is selected. c. Click the Employees table and verify that the NOTES field is selected. 3.3 Creating Data Profiles d. Click the Orders table and verify that the NOTES field is selected. e. Click the MANUFACTURERS table and verify that the NOTES field is selected. 8. Select all metrics. a. Select Tools Default Profile Metrics. b. Verify that all metrics are selected. c. Click OK to close the Metrics window. 9. Select File Save Profile to save the profile. 10. Select Actions Run Profile Report. a. Type Profiling a Collection of NOTES fields in the Description field. b. Click OK to close the Run Profile window. The profile executes. The status of the execution is displayed. The Report tab becomes active. 11. Review the profile report. a. Click in front of dfConglomerate Gifts to view the available tables. b. Click in front of dfConglomerate Grocery to view the available table. c. Click the Customers table and verify that the NOTES field was the only analyzed field. d. Click the Employees table and verify that the NOTES field was the only analyzed field. e. Click the Orders table and verify that the NOTES field was the only analyzed field. f. Click the MANUFACTURERS table and verify that the NOTES field was the only analyzed field. 12. Continue to investigate the profile report (if desired). 13. Select File Close Profile. 3-69 3-70 Chapter 3 PLAN Exercises 5. Profiling a Text File Find more information about the data contained in the Manufacturer_Contact_List.txt file. Some specifics follow: • Use the Basics Exercises repository to store the profile. (Add to the profile_and_explorations folder.) Provide a name of Ch3E5_Manufacturer_TextFile_Profile. • File attributes: Table name: Manufacturer_Contacts File Type: Delimited Filename and location: S:\Workshop\dqdmp1\data\Text Files\Manufacturer_Contact_List.txt Text qualifier: Double quotation mark Field delimiter: Comma Number of rows to skip: 1 Use the Suggest feature. (The first row contains field names. Guess field types and lengths from file content.) Use the Export feature (to export the file layout) for this text file. Name the exported file S:\Workshop\dqdmp1\data\Output Files\Manufacturers_Contact_List.dfl. Select all fields for profiling. Select all profiling metrics. Save and run the profile. Fields: • • • • 3.3 Creating Data Profiles 3-71 • Answer the following questions: QUESTION: How many patterns exist for the WORK_STATE field? Answer: QUESTION: How many values exist for the WORK_STATE field? Answer: QUESTION: What values exist for the WORK_STATE field? Answer: QUESTION: What do the records from the WORK_STATE of CA look like? Answer: QUESTION: What can be said about the comparison of the Pattern Count and Unique Count metrics for all fields? (Hint: Create a bar chart to display the comparison.) Answer: 3-72 Chapter 3 PLAN Data Profile Report – Redundant Data Analysis 58 Redundant Data analysis enables you to quickly review whether there is redundant data for selected fields and tables. Finding redundant or duplicate records of data is discussed in a future chapter. Data Profile Report – PK/FK Analysis 59 The Primary Key/Foreign Key analysis can be used to determine the number of common values in two different tables to determine whether it is possible to set a primary key/foreign key relationship and thus create a parent and child table. This analysis is useful in maintaining the referential integrity between tables. 3.3 Creating Data Profiles 3-73 Data Profiling – Additional Analysis (Self-Study) This demonstration investigates how to setup and review a redundant data analysis as well as a primary key / foreign key analysis. 1. If necessary, select Start All Programs DataFlux Data Management Studio 2.2. DataFlux Data Management Studio appears. 2. Click the Folders riser bar. 3. Click in front of Basics Demos to view the available folders. 4. Click in front of profiles_and_explorations to view the available items. 5. Right-click the Ch3D3_dfConglomerate_Profile profile and select Open. 6. Select File Save Profile As to save the profile. a. Type Ch3D7_dfConglomerate_Profile_RDA_PKFK as the value for the Name field. b. Verify the profiles_and_explorations folder is selected. c. Click Save. 7. Click the Properties tab. 8. Establish redundant data analysis. a. Select Actions Analyze Redundant Data. b. Click to the right of the Configurations area. c. Type Ch3D7_Compare_Overlap_of_PhoneNumber_fields in the Name field. d. Click Match code. e. Click in the Match definition field and select Phone. f. in the Sensitivity field and select 95. Click g. Click in front of dfConglomerate Gifts to view the available tables. h. Click in front of the Customers table to view the available fields. i. Double-click BUSINESS PHONE to move this field to the Selected list. j. Double-click HOME PHONE to move this field to the Selected list. k. Double-click MOBILE PHONE to move this field to the Selected list. l. Click in front of the Employees table to view the available fields. 3-74 Chapter 3 PLAN m. Double-click BUSINESS PHONE to move this field to the Selected list. n. Double-click HOME PHONE to move this field to the Selected list. o. Double-click MOBILE PHONE to move this field to the Selected list. The final settings should resemble the following: p. Click OK to close the Redundant Data Analysis Configuration Properties window. 3.3 Creating Data Profiles This instance of redundant data analysis appears in the configurations listing: q. Click Close to close the Redundant Data Analysis Configurations window. 9. Verify that the redundant data analysis is specified at the column level. a. If necessary, click the Properties tab. b. Click in front of dfConglomerate Gifts to view the available tables. c. Click the Customers table to view the available fields. d. Verify that a check mark appears under the R column for the three phone fields. e. Click the Employees table to view the available fields. f. Verify that a check mark appears under the R column for the three phone fields. 3-75 3-76 Chapter 3 PLAN 10. Select File Save Profile to save the profile. 11. Establish Primary Key/Foreign Key analysis. a. If necessary, click the Properties tab. b. If necessary, click in front of dfConglomerate Gifts to view the available tables. c. If necessary, click the Customers table to view the available fields. d. Click the ID field. e. Select Actions Analyze Primary Key/Foreign Key Relationships. f. Click g. Click in front of dfConglomerate Gifts to view the available tables. in front of the Orders table to view the available fields. h. Click CUSTOMER_ID. i. Click Add. 3.3 Creating Data Profiles j. Click OK to close the Primary Key/Foreign Key Analysis window. 12. Verify that the Primary Key/Foreign Key Analysis is specified at the column level. a. If necessary, click the Properties tab. b. Click in front of dfConglomerate Gifts to view the available tables. c. Click the Customers table to view the available fields. d. Verify that a check mark appears under the P column for the ID field. 3-77 3-78 Chapter 3 PLAN 13. Select File Save Profile to save the profile. 14. Select Actions Run Profile Report. a. Type Performing Redundant Data and PK/FK analyses in the Description field. b. Click OK to close the Run Profile window. The profile executes. The status of the execution is displayed. The Report tab becomes active. 15. Review the profile report for the Primary Key/Foreign Key analysis. a. Click in front of dfConglomerate Gifts to view the available tables. b. Click in front of the Customers table to view the list of columns. 3.3 Creating Data Profiles c. Click the ID column. d. Click the Primary Key/Foreign Key Analysis tab. e. Verify that the match percentage is 100%. 16. Review the Redundant Data analysis. a. Click the Redundant Data Analyses riser bar in the Resource pan e. 3-79 3-80 Chapter 3 PLAN b. Click the Ch3D7_Compare_Overlap_of_PhoneNumber_fields item. The main area is updated with the analysis results. The view contains two portions: Grid Data Table and the Data Venn Diagram. The Grid Data Table displays the relationships between the fields that were selected for analysis. You can click on a cell to see the relationship displayed in the Venn diagram. You can also click the Grid data radio buttons (Left outliers, Common, Right outliers) to control which of the relationships are displayed in the table. The colored dots in the table cells indicate the percentage of matching rows between the two selected rows: • Green reflects a low level of redundancy (under 20% by default). • Yellow reflects a medium level of redundancy (between 20% and 50% by default). • Red reflects a high level of redundancy (more than 50% by default). These levels can be changed in the Options dialog box on the Report tab. 3.3 Creating Data Profiles 3-81 Explanation of the column grid: These fields are from the Customers table. These fields are from the Employees table. c. Click in the cell where the BUSINESS PHONE from Customers intersects with the BUSINESS PHONE from Orders. 3-82 Chapter 3 PLAN d. Double-click the overlapping area. This value for the BUSINESS PHONE was found four times in the Customers table, and once in the Employees table. The information can be exported to a text file or printed. e. Click Close. 17. Select File Close Profile. 3.4 Designing Data Standardization Schemes 3-83 3.4 Designing Data Standardization Schemes Objectives Define and create standardization schemes. 67 Using an Analysis Report to Build a Scheme 68 The analysis of an individual field can be counted as a whole (phrase) or based on each one of the field’s elements. For example, the field value DataFlux Corporation is treated as two permutations if the analysis is set as Element, but it would be treated only as one permutation if the analysis is set as Phrase. The above analysis is a phrase analysis of the Company Name field. Similar company names are grouped together based on the match definition and sensitivity selected for the analysis. 3-84 Chapter 3 PLAN What Is a Standardization Scheme? A standardization scheme takes various spellings or representations of a data value and lists a standard way to consistently write this value. 69 A standardization scheme can be built based on the analysis report. When a scheme is applied, if the input data is equal to the value in the Data column, then the data is changed to the value in the Standard column. The standard of DataFlux was selected by the Scheme Build function because it was the permutation with the most occurrences in the analysis report. The following special values can be used in the Standard column: //Remove The matched word or phrase is removed from the input string. % The matched word or phrase is not updated. (This is used to show that a word or phrase is explicitly marked for no change.) 3.4 Designing Data Standardization Schemes 3-85 Creating a Phrase Standardization Scheme This demonstration illustrates creating a phrase standardization scheme. 1. If necessary, select Start All Programs DataFlux Data Management Studio 2.2. DataFlux Data Management Studio appears. 2. Click the Folders riser bar. 3. Click in front of Basics Demos to view the available folders. 4. Click in front of profiles_and_explorations to view the available items. 5. Right-click the Ch3D3_dfConglomerate_Profile profile and select Open. 6. Click the Report tab. 7. Click in front of dfConglomerate Gifts to view the available folders. 8. Click the Customers table. 9. If necessary, click the Standard Metrics tab. 10. On the Standard Metrics tab, click the COMPANY field. 11. Select Actions Build a Scheme. The Report Generation window appears. a. Under Phrase analysis, click Build). b. Click in the Match definition field and select Organization (Scheme in the Sensitivity field and select 75. c. Click OK to close the Report Generation window. 3-86 Chapter 3 PLAN The Scheme Builder window appears with the Report side populated with an alphabetical list of the values found in the Frequency Distribution report. The values are grouped with similar names. The group is determined by the selected match definition and sensitivity. d. Select Edit Build Scheme. 1) Verify that English (United States) is selected. 2) Click OK to close the Select Locale(s) window. 3.4 Designing Data Standardization Schemes 3-87 By default, all values are “moved” to the Scheme side of the Scheme Builder window. The multiitem groups are given a standard value, which is the most frequently occurring value with the grouping on the report side. e. Change the standard value of dfConglomerate Incorporated. 1) Scroll on the Scheme side and locate the group of records with the standard value of dfConglomerated Incorporated. 2) Right-click on one of the dfConglomerate Incorporated standard values and select Edit. a) Type dfConglomerate Inc. in the Standard field. b) Click OK. Notice that the change applies to all items in the group. If a single value in a group of items needs to be changed, then select Edit Modify Standards Manually Single Instance. A single value can then be modified manually. To toggle back to the ability to change all instances in a group, select Edit Modify Standards Manually All Instances. 3-88 Chapter 3 PLAN f. Change the standard value of Eta Technologies. 1) Scroll on the Scheme side and locate the group of records with the standard value of Eta Technologies. 2) Right-click on one of the Eta Technologies standard values and select Edit. a) Type ETA Computers in the Standard field. b) Click OK. Notice that the change applies to all items in the group. g. Change the standard value of Farmers Insurance Grp Inc. 1) Scroll on the Scheme side and locate the group of records with the standard value of Farmers Insurance Grp Inc. 2) Right-click on one of the Farmers Insurance Grp Inc standard values and select Edit. a) Type Farmers Insurance Group in the Standard field. b) Click OK. h. Change the standard value of Safeguard Business Sys. 1) Scroll on the Scheme side and locate the group of records with the standard value of Safeguard Business Sys. 2) Right-click on one of the Safeguard Business Sys standard values and select Edit. a) Type Safeguard Business Systems in the Standard field. b) Click OK. 3.4 Designing Data Standardization Schemes i. 3-89 Change the standard value of ?. 1) Scroll on the Scheme side and locate the record with the standard value of ?. 2) Right-click on the? standard value and select Edit. a) Type //Remove in the Standard field. This removes the value from the field where this scheme is applied. b) Click OK. j. Make the scheme case insensitive (by default). 1) Click Options. 2) Clear the Case sensitive check box. 3) Click OK. 12. Save the scheme to the default QKB. a. Select File Save. b. Type Ch3D8 Company Phrase Scheme in the Name field. c. Click Save. 13. Select File Exit to close the Scheme Builder window. 14. If necessary, click the Report tab. 15. Click in front of dfConglomerate Grocery to view the available folders. 16. Click the MANUFACTURERS table. 17. If necessary, click the Standard Metrics tab. 18. On the Standard Metrics tab, click the MANUFACTURER field. 19. Select Actions Build a Scheme. The Report Generation window appears. a. Under Phrase analysis, click Build). b. Click in the Match definition field and select Organization (Scheme in the Sensitivity field and select 75. c. Click OK to close the Report Generation window. 20. Access the scheme Ch3D8 Company Phrase Scheme. a. Select File Open. b. Click Ch3D8 Company Phrase Scheme. c. Click Open. 3-90 Chapter 3 PLAN 21. Select Report Compare Report to Scheme Highlight Unaccounted Permutations. Values not already in the scheme are highlighted in red. 22. Update the existing scheme for a set of values. a. At the bottom of the Report side, locate the Standard field. b. Type Arrowhead Mills, Inc in the Standard field. c. Locate the group of records that begin with Arrowhead Mills, Inc. d. Click the first of these records. e. Press and hold the SHIFT key and click the last of these records. f. Click Add to Scheme (at the bottom of the Report side). 3.4 Designing Data Standardization Schemes 3-91 These five values are added to the scheme. Notice that the values are no longer highlighted in red. 23. Update the existing scheme for a set of values. a. A\ the bottom of the Report side, locate the Standard field. b. Type Breadshop Natural Foods in the Standard field. c. Locate the group of records that begin with Breadshop Natural Foods. d. Click the first of these records. e. Press and hold the SHIFT key and click the last of these records. f. Click Add to Scheme (at the bottom of the Report side). 24. Update the existing scheme for a set of values. a. Locate the group of records that begin with General Mills. b. Double-click the value General Mills. This action populates the Standard field with the General Mills value. c. Select the records in this group that are red. d. Click Add to Scheme (at the bottom of the Report side). A warning window appears and indicates that there is duplicate data. e. Click OK to close the warning window (titled Duplicate Data). 25. Update the existing scheme for a set of values. a. Locate the group of records that begin with Hannaford. b. Double-click the value Hannaford Bros. c. Select the records that begin with Hannaford. d. Click Add to Scheme (at the bottom of the Report side). 26. Select File Save. 27. Select File Exit to close the Scheme Builder window. 3-92 Chapter 3 PLAN Creating an Element Standardization Scheme This demonstration illustrates creating an element standardization scheme. 1. If necessary, access DataFlux Data Management Studio and the Ch3D3_dfConglomerate Profile. a. Select Start All Programs DataFlux Data Management Studio 2.2. DataFlux Data Management Studio appears. b. Click the Folders riser bar. c. Click in front of Basics Demos to view the available folders. d. Click in front of profiles_and_explorations to view the available items. e. Right-click the Ch3D3_dfConglomerate_Profile profile and select Open. 2. Click the Report tab. 3. Click in front of dfConglomerate Gifts to view the available folders. 4. Click the Customers table. 5. If necessary, click the Standard Metrics tab. 6. On the Standard Metrics tab, click the ADDRESS field. 7. Select Actions Build a Scheme. The Report Generation window appears. a. Click Element analysis. b. Keep the Chop table field value set to None. This builds a report for each individual word in the selected field (separated by a space). c. Click OK to close the Report Generation window. 3.4 Designing Data Standardization Schemes 8. Investigate the Drive values. a. Select Report Find in Report. b. Type Dr in the Find what field. c. Select the Match whole word only check box. d. Click Find Next. e. Click Cancel to close the Find window. The Report side of the Scheme Builder window has the first occurrence of Dr highlighted. f. Click the Dr value. g. Select Report Permutation Drill Through. h. Click Close to close the Permutation Drill Through window. 3-93 3-94 Chapter 3 PLAN i. Add Dr, Dr., Drive, and Drive, to the scheme with a standard of Drive. 1) Type Drive in the Standard field. 2) In the report, click the Dr value. Then hold down the CTRL key, and click the Dr., Drive, and Drive, values. 3) Click Add To Scheme. 9. Investigate and set a standard for the Avenue values. a. Select Report Find in Report. b. Type Avenue in the Find what field. c. Click the Match whole word only check box. d. Click Up. e. Click Find Next. f. Click Cancel to close the Find window. The Report side of the Scheme Builder window has the first occurrence of Avenue highlighted. 10. Add permutations of Avenue to the scheme with a standard of Avenue. a. Double-click the value Avenue. The Standard field is updated. b. In the report, click the Ave value. Then hold down the CTRL key, and click Ave. and Avenue values. c. Click Add To Scheme. The values with the specified standard are added to the Scheme side of the Scheme Builder window. 3.4 Designing Data Standardization Schemes 3-95 11. Save the scheme. a. Select File Save to save the standardization scheme in the Quality Knowledge Base (QKB). b. Type Ch3D9 Address Element Scheme in the Name field. c. Click Save. 12. Select File Exit to close the Scheme Builder window. 13. Select File Close Profile. 3-96 Chapter 3 PLAN Importing a Scheme from a Text File This demonstration illustrates importing a scheme from a text file. 1. If necessary, select Start All Programs DataFlux Data Management Studio 2.2. 2. Select Tools Other QKB Editors Scheme Builder. 3. Click the English (United States) locale and click Open. 4. Select File Import From Text File. The Import From Text File window appears. 5. Specify settings for the Import From Text File window. a. Click next to the Import file name field. 1) Navigate to S:\Workshop\dqdmp1\data\Text Files. 2) Click the Training_Demo_Company_Scheme.txt file. 3) Click Open. b. Click in the Text qualifier field and select " (double quotation mark). c. Type 1 in the Number of rows to skip field. d. Verify that Comma is selected as the value for the field delimiter. e. Click OK. 3.4 Designing Data Standardization Schemes 3-97 6. Save the scheme. a. Select File Save As to save the standardization scheme in the Quality Knowledge Base (QKB). 1) Click English (United States). 2) Click OK. b. Type Ch3D10 Company Scheme (Text File) in the Name field. c. Click Save. 7. Select File Exit to close the Scheme Builder window. 3-98 Chapter 3 PLAN Exercises 6. Creating a Scheme from the CONTACT_CNTRY Field. Open the Ch3E3_dfConglomerateGrocery_Profile profile. • Locate the CONTACT_CNTRY field in the MANUFACTURERS table. • Verify that the distinct set of values can all be represented by USA. (Be sure to investigate the X value to confirm this.) • Create a scheme that translates all the values to USA. • Save the scheme as Ch3E6 CONTACT_CNTRY Scheme. 7. Importing a Scheme from a Text File. Import a scheme from the following text file: S:\Workshop\dqdmp1\data\Text Files\Training_Exercise_Company_Scheme.txt • Use the following specifications for the text file: Text qualifier: " (double quotation mark) Number of rows to skip: 1 Comma Field Delimiter: • Save the new scheme as Ch3E7 Company Scheme (Text File). 3.5 Solutions to Exercises 3.5 Solutions to Exercises 1. Creating a Dates Collection a. If necessary, select Start All Programs DataFlux Data Management Studio 2.2. b. Verify that the Home tab is selected. c. Click the Collections riser bar. d. Click the Basics Exercises repository. e. Click (or right-click Exercises and select New Collection). 1) Type Dates in the Name field. 2) Type Collection of Date Fields in the Description field. 3) Click OK. f. Click the newly defined Dates collection in the Navigation pane. g. Click Insert Fields in the information pane for the Dates collection. The Insert Fields window appears. h. Add date fields from the dfConglomerate Gifts data connection. 1) Click next to the Connection field. 2) Click dfConglomerate Gifts. 3) Click OK. 4) Add fields from the Order Details table. a) Click in front of Order Details table. b) Click the DATE ALLOCATED field. (Click the check box.) c) Click Add. d) Clear the selections for the Order Details table. 5) Add fields from the Orders table. a) Click in front of the Orders table. b) Click the ORDER DATE field. (Click the check box.) c) Click the PAID DATE field. (Click the check box.) d) Click the SHIPPED DATE field. (Click the check box.) e) Click Add. 3-99 3-100 Chapter 3 PLAN i. Add date fields from the dfConglomerate Grocery data connection. 1) Click next to the Connection field. 2) Click dfConglomerate Grocery. 3) Click OK. 4) Add fields from the BREAKFAST_ITEMS table. a) Click in front of the BREAKFAST_ITEMS table. b) Click the DATE field. (Click the check box.) c) Click Add. 5) Add fields from the MANUFACTURERS table. a) Click in front of the MANUFACTURERS table. b) Click the POSTDATE field. (Click the check box.) c) Click Add. 6) Click Close to close the Insert Fields window. The Dates collection should now resemble the following: QUESTION: How many date fields were added to the collection? Answer: Six (6) 3.5 Solutions to Exercises 2. Creating a Phone Numbers Collection a. If necessary, select Start All Programs DataFlux Data Management Studio 2.2. b. Verify that the Home tab is selected. c. Click the Collections riser bar. d. Click the Basics Exercises repository. e. Click (or right-click Exercises and select New Collection). 1) Type Phone Numbers in the Name field. 2) Type Collection of Phone Number Fields in the Description field. 3) Click OK. f. Click the newly defined Phone Number collection in the Navigation pane. g. Click Insert Fields in the information pane for the Phone Number collection. The Insert Fields window appears. h. Add phone number fields from the dfConglomerate Gifts data connection. 1) Click next to the Connection field. 2) Click dfConglomerate Gifts. 3) Click OK. 4) Add fields from the Customers table. a) Click in front of the Customers table. b) Click the BUSINESS PHONE field. (Click the check box.) c) Click the FAX NUMBER field. (Click the check box.) d) Click the HOME PHONE field. (Click the check box.) e) Click the MOBILE PHONE field. (Click the check box.) f) Click Add. g) Clear the selections for the Customers table. 5) Add fields from the Employees table. a) Click in front of the Employees table. b) Click the BUSINESS PHONE field. (Click the check box.) c) Click the FAX NUMBER field. (Click the check box.) d) Click the HOME PHONE field. (Click the check box.) e) Click the MOBILE PHONE field. (Click the check box.) 3-101 3-102 Chapter 3 PLAN f) Click Add. g) Clear the selections for the Employees table. i. Add phone number fields from the dfConglomerate Grocery data connection. next to the Connection field. 1) Click 2) Click dfConglomerate Grocery. 3) Click OK. 4) Add fields from the MANUFACTURERS table. a) Click in front of the MANUFACTURERS table. b) Click the PHONE field. c) Click the CONTACT_PHONE field. (Click the check box.) d) Click Add. 5) Click Close to close the Insert Fields window. The Phone Numbers collection should now resemble the following: QUESTION: How many phone number fields were added to the collection? Answer: Ten (10) 3. Profiling the Tables in dfConglomerate Grocery a. If necessary, select Start All Programs DataFlux Data Management Studio 2.2. b. Click the Folders riser bar. c. Click the Basics Exercises repository. 3.5 Solutions to Exercises d. Click 3-103 Profile. 1) Double-click the profiles_and_explorations folder. (This action makes this folder the value of the Save in field.) 2) Type Ch3E3_dfConglomerateGrocery_Profile in the Name field. 3) Click OK. e. Verify that the Properties tab is selected. f. Click the dfConglomerate Grocery check box. An X appears in the check box. g. Define profile options. 1) Select Tools Default Profile Metrics. 2) Verify that all metrics are selected. 3) Click OK to close the Metrics window. h. Select File Save Profile to save the profile to this point. i. Override metrics for two ID columns. 1) Click in front of dfConglomerate Grocery. 2) Click the MANUFACTURERS table. 3) Click the ID field. 4) Right-click the ID column and select Override Metrics. a) Clear the selection for Frequency distribution. b) Clear the selection for Pattern frequency distribution. c) Clear the selection for Percentiles. d) Clear the selection for Outliers. e) Click OK to close the Metrics window. 5) Click the BREAKFAST_ITEMS table. 6) Click the ID field. 7) Right-click the ID column and select Override Metrics. a) Clear the selection for Frequency distribution. b) Clear the selection for Pattern frequency distribution. c) Clear the selection for Percentiles. d) Clear the selection for Outliers. e) Click OK to close the Metrics window. 3-104 Chapter 3 PLAN j. Select File Save Profile to save the profile. k. Select Actions Run Profile Report. 1) Type Initial profile in the Description field. 2) Click OK to close the Run Profile window. The Report tab becomes active. l. Review the Profile report. QUESTION: How many distinct values exist for the UOM field in the BREAKFAST_ITEMS table? Answer: to expand the BREAKFAST_ITEMS table. Click the UOM field. Click The Column Profiling tab shows the Unique Count metric to have a value of 6 (six). QUESTION: What are the distinct values for the UOM field in the BREAKFAST_ITEMS table? Answer: If necessary, click to expand the BREAKFAST_ITEMS table, and then click the UOM field. Click the Frequency Distribution tab to display the six unique values (OZ, CT, LB, <null value>, PK, 0Z). QUESTION: What is the ID field value for the records with PK as a value for the UOM field in the BREAKFAST_ITEMS table? Answer: If necessary, click to expand the BREAKFAST_ITEMS table, and then click the UOM field. Click the Frequency Distribution tab to display the six unique values. Double-click the value PK. Notice that the ID values for these two records are 556 and 859. m. Filter for values of UOM with a frequency count greater than 5. 1) Verify that the Frequency Distribution tab is selected (for the selected UOM field). 2) Select View Report Filter. 3) Under Filter condition, select Frequency count as the value for the Field type field. 4) Under Filter condition, select greater than as the value for the Field type field. 5) Type 5 (five) in the Single value field. 3.5 Solutions to Exercises 3-105 6) Click Add Condition. 7) Click OK to close the Report Filter window. The Frequency Distribution tab is updated. n. Add a note for the UOM field. 1) Verify that the UOM field is selected. 2) Select Insert New Note. 3) Type Standardize the UOM field in the Add Note window. 4) Click OK to close the Add Note window. 5) Verify that the new note appears on the Notes tab. o. View the pattern frequency distribution for the PHONE field in the MANUFACTURERS table. 1) Click in front of the MANUFACTURERS table. 2) Click the PHONE field. 3) Click the Pattern Frequency Distribution tab. 4) Double-click the pattern (999)999-9999. The Pattern Frequency Distribution Drill Through window appears. 5) Click Close to close the Pattern Frequency Distribution Drill Through window. 3-106 Chapter 3 PLAN p. Add a note for the PHONE field. 1) Verify that the PHONE field is selected. 2) Select Insert New Note. 3) Type Standardize the PHONE field in the Add Note window. 4) Click OK to close the Add Note window. 5) Verify that the new note appears on the Notes tab. q. Investigate data via a visualization. 1) Click the MANUFACTURERS table. 2) Click the Visualizations tab. 3) Click to the right of the Chart field. 4) Type Comparing Min and Max Lengths for all fields in the Description field. 5) Select Line as the value for the Chart type field. 6) Verify that the data type is set to Field metrics. 7) Click Select/unselect all under the Fields area. (This selects all fields.) 8) Click Minimum Length and Maximum Length as the choices under the Metrics area. 9) Click OK to close the Chart Properties window. 3.5 Solutions to Exercises The Chart window appears. r. Select File Close Profile. 3-107 3-108 Chapter 3 PLAN 4. Profiling a Selection of Tables in a DataFlux Sample a. If necessary, select Start All Programs DataFlux Data Management Studio 2.2. b. Click the Folders riser bar. c. Click the Basics Exercises repository. d. Click Profile. 1) Double-click the profiles_and_explorations folder. (This action makes this folder the value of the Save in field.) 2) Type Ch3E4_DataFluxSample_Profile in the Name field. 3) Click OK. e. Verify that the Properties tab is selected. 1) Click in front of DataFlux Sample. 2) Click the check box in front of the Client_Info table. An X appears in the check box. 3) Click the check box in front of the Contacts table. An X appears in the check box. 4) Click the check box in front of the Product table. An X appears in the check box. 5) Click the check box in front of the Product_Sales table. An X appears in the check box. f. Define profile options. 1) Select Tools Default Profile Metrics. 2) If necessary, click Select/unselect all in the Column profiling area. 3) Verify that the top four metrics (Frequency distribution, Pattern frequency distribution, Percentiles, and Outliers) are not selected. 4) Click OK to close the Metrics window. g. Select File Save Profile to save the profile to this point. h. Override metrics for the Phone and State fields of the Client_Info and Contacts tables. 1) Click Client_Info table. 2) Click the Phone field. 3) Hold down the CTRL key and then click State. 4) Right-click the Phone column and select Override Metrics. a) Click Frequency distribution. b) Click Pattern frequency distribution. c) Click OK to close the Metrics window. 5) Click the Contacts table. 3.5 Solutions to Exercises 3-109 6) Click the PHONE field. 7) Hold down the CTRL key and then click STATE. 8) Right-click the Phone column and select Override Metrics. a) Click Frequency distribution. b) Click Pattern frequency distribution. c) Click OK to close the Metrics window. i. Override metrics for the WEIGHT_OZS field of the Product table. 1) Click Product table. 2) Right-click the WEIGHT_OZS column and select Override Metrics. a) Click Frequency distribution. b) Click Pattern frequency distribution. c) Click Percentiles. d) Click Outliers. e) Click OK to close the Metrics window. j. Select File Save Profile to save the profile. k. Select Actions Run Profile Report. 1) Type Initial profile in the Description field. 2) Click OK to close the Run Profile window. The Report tab becomes active. l. Review the profile report. QUESTION: How many distinct values exist for the STATE field in the Contacts table? Answer: (i) Click to expand the Contacts table. (ii) Click the STATE field. The Column Profiling tab shows the Unique Count metric to have a value of 61. QUESTION: How many distinct patterns exist for the STATE field in the Contacts table? Answer: (i) Click to expand the Contacts table. (ii) Click the STATE field. The Column Profiling tab shows the Pattern Count metric to have a value of 21. 3-110 Chapter 3 PLAN QUESTION: What does a bar chart graphic (visualization) tell you about the comparison of the data length and maximum length metrics for all fields? (Do not include the fields DATE, DELETE_FLG, ID, and MATCH_CD in the graphic view.) Answer: (i) Click the Contacts table. (ii) Click the Visualizations tab. (iii) Click to the right of the Chart field. (iv) Type Comparing Data and Max Lengths for all fields in the Description field. (v) Select Bar as the value for the Chart type field. (vi) Verify that the data type is set to Field metrics. (vii) Click Select/unselect all in the Fields area. (This selects all fields.) (viii) Clear the selections for DELETE_FLG, ID, and MATCH_CD fields. (ix) Click Data Length and Maximum Length as the choices in the Metrics area. (x) Click OK to close the Chart Properties window. The Visualizations tab now displays the following: One field (ADDRESS) has a defined length much longer than the maximum length. Several additional fields have data lengths longer than maximum length used. 3.5 Solutions to Exercises 3-111 QUESTION: What is the ID field value for the record(s) with “bad” PHONE field pattern in the CONTACTS table? Answer: (i) Click to expand the Contacts table. (ii) Click the PHONE field. (iii) Click the Pattern Frequency Distribution tab. (iv) Double-click the pattern (999)999-9999. (v) Click Close. (vi) Select Insert New Note. (vii) Type Be sure to correct the PHONE field pattern. (viii) Click OK. m. Add a table-level note. 1) Click the Contacts table. 2) Select Insert New Note. 3) Type Be sure to standardize the STATE and PHONE fields. 4) Click OK. n. Select File Close Profile. 5. Profiling a Text File a. If necessary, select Start All Programs DataFlux Data Management Studio 2.2. b. Click the Folders riser bar. c. Click the Basics Exercises repository. d. Click Profile. 1) Double-click the profiles_and_explorations folder. (This action makes this folder the value of the Save in field.) 2) Type Ch3E5_Manufacturer_TextFile_Profile in the Name field. 3) Click OK. 3-112 Chapter 3 PLAN e. On the Properties tab, click Text Files. f. Select Insert New Text File. 1) Type Manufacturer_Contacts in the Table name field. 2) Click OK to close the Insert Text File window. g. Specify delimited file information. 1) Click next to the Input file name field. The Open window appears. 2) Navigate to S:\Workshop\dqdmp1\data\Text Files. 3) Click the Manufacturer_Contact_List.txt file. 4) Click Open. 5) Click in the Text qualifier field and select " (double quotation mark). 6) Verify that the default delimiter is set to Comma. 7) Click Number of rows to skip and verify that the default value is set to one. 8) Click Suggest in the Fields area. a) Click First row contains field names. b) Click Guess field types and lengths from the file content. c) Click OK. 9) Click Export in the Fields area. a) Navigate to S:\Workshop\dqdmp1\data\Text Files. b) Type Manufacturer_Contact_List.dfl in the File name field. c) Click Save. (Replace the file if necessary.) 10) Click OK to close the Delimited File Information window. The field information is returned to the profile. The fields here can be “processed” as you did for a table source. 3.5 Solutions to Exercises h. If necessary, select all fields by clicking (check box) in front of Manufacturer_Contacts. i. Select all metrics. 1) Select Tools Default Profile Metrics. 2) Verify that all metrics are selected. 3) Click OK to close the Metrics window. j. Select File Save Profile to save the profile. k. Select Actions Run Profile Report. 1) Type Profiling a text file in the Description field. 2) Click OK to close the Run Profile window. The profile executes. The status of the execution is displayed. The Report tab becomes active. 3-113 3-114 Chapter 3 PLAN QUESTION: How many patterns exist for the WORK_STATE field? 1) If necessary, click in front of Manufacturer_Contacts to view the available fields. 2) Click the WORK_STATE field. Answer: Notice the Pattern Count metric (1). QUESTION: How many values exist for the WORK_STATE field? Answer: Notice the Unique Count metric (12). QUESTION: What values exist for the WORK_STATE field? QUESTION: What do the records from the WORK_STATE of CA look like? Double-click the value CA. 3.5 Solutions to Exercises 3-115 QUESTION: What can be said about the comparison of the Pattern Count and Unique Count metrics for all fields? 1) Click the text file Manufacturer_Contacts. 2) Click the Visualizations tab. 3) Click to the right of the Chart field. a) Type Comparing Pattern and Unique Count metrics for all fields in the Description field. b) Select Bar as the value for the Chart type field. c) Verify that the Data type is set to Field metrics. d) Click Select/unselect all in the Fields area. e) Click Pattern Count and Unique Count in the Metrics area. f) Click OK. b. Select File Close Profile. 3-116 Chapter 3 PLAN 6. Creating a Scheme from the CONTACT_CNTRY Field a. If necessary, select Start All Programs DataFlux Data Management Studio 2.2. b. Click the Folders riser bar. c. Navigate to Basics Exercises profiles_and_explorations. d. Double-click Ch3E3_dfConglomerateGrocery_Profile. The profile opens on a new tab. e. If necessary, click in front of MANUFACTURERS to view the available fields. f. Click the CONTACT_CNTRY field. g. Click the Frequency Distribution tab. h. Double-click the X value. i. Verify that the contact information should have a CONTACT_CNTRY value of USA. j. Click Close. k. From the Tables riser bar, click the MANUFACTURERS table. l. Right-click the CONTACT_CNTRY field and select Build a Scheme. m. Accept the defaults in the Report Generation window and click OK. n. Double-click the value USA. o. Click all three values in the report. p. Click Add To Scheme. q. Select File Save As. 1) Type Ch3E6 CONTACT_CNTRY Scheme in the Name field. 2) Click Save. r. Select File Exit to close the Scheme Builder window. 3.5 Solutions to Exercises s. Select File Close Profile to close the profile tab. 7. Importing a Scheme from a Text File a. If necessary, select Start All Programs DataFlux Data Management Studio 2.2. b. Select Tools Other QKB Editors Scheme Builder. c. Click the English (United States) locale and click Open. d. Select File Import From Text File. The Import From Text File window appears. e. Specify the settings for the Import From Text File window. 1) Click next to the Import file name field. a) Navigate to S:\Workshop\dqdmp1\data\Text Files. b) Click the Training_Exercise_Company_Scheme.txt file. c) Click Open. 2) Click in the Text qualifier field and select " (double quotation mark). 3) Type 1 in the Number of rows to skip field. 4) Verify that Comma is selected as the value for the Field Delimiter. 5) Click OK. f. Save the scheme. 1) Select File Save As to save the standardization scheme in the QKB. a) Click English (United States). b) Click OK. 2) Type Ch3E7 Company Scheme (Text File) in the Name field. 3) Click Save. g. Select File Exit to close the Scheme Builder window. 3-117 3-118 Chapter 3 PLAN Chapter 4 ACT 4.1 Introduction to Data Jobs .............................................................................................. 4-3 Demonstration: Setting DataFlux Data Management Studio Options ................................... 4-5 Demonstration: Creating and Running a Simple Data Job .................................................. 4-10 Exercises .............................................................................................................................. 4-33 4.2 Data Quality Jobs ......................................................................................................... 4-34 Demonstration: Investigating Standardizing, Parsing, and Casing ...................................... 4-38 Demonstration: Investigating Right Fielding and Identification Analysis ............................. 4-50 Exercises .............................................................................................................................. 4-61 Demonstration: Using a Standardization Definition and a Standardization Scheme ........... 4-63 4.3 Data Enrichment Jobs (Self-Study)............................................................................. 4-75 Demonstration: Performing Address Verification and Geocoding ........................................ 4-77 Exercises .............................................................................................................................. 4-88 4.4 Entity Resolution Jobs ................................................................................................. 4-90 Demonstration: Creating a Data Job to Cluster Records..................................................... 4-93 Exercises ............................................................................................................................ 4-108 Demonstration: Creating an Entity Resolution Job ............................................................ 4-112 Demonstration: Creating a Data Job to Compare Clusters (Optional) .............................. 4-138 4.5 Multi-Input/Multi-Output Data Jobs (Self-Study) ...................................................... 4-147 Demonstration: Multi-Input/Multi-Output Data Job: New Products .................................... 4-149 Demonstration: Multi-Input/Multi-Output Data Job: Customer Matches ............................ 4-159 Exercises ............................................................................................................................ 4-166 4-2 Chapter 4 ACT 4.1 Introduction to Data Jobs 4.1 Introduction to Data Jobs Objectives Discuss and explore the basic DataFlux Data Management Studio options for affecting data jobs. Create and execute a simple data job. 3 What Is a Data Job? Data Job 4 4 Data jobs are the main way to process data in DataFlux Data Management Studio. Each data job specifies a set of data-processing operations that flow from source to target. 4-3 4-4 Chapter 4 ACT Data Job Options 5 5 This dialog box enables you to set global options for data jobs as well as other areas of DataFlux Data Management Studio. 4.1 Introduction to Data Jobs Setting DataFlux Data Management Studio Options In this demonstration, you investigate and set various DataFlux Data Management Studio options. 1. If necessary, select Start All Programs DataFlux Data Management Studio 2.2. DataFlux Data Management Studio appears. 2. Verify that the Home tab is selected. 3. Select Tools Data Management Studio Options. The Data Management Studio Options window appears. 4-5 4-6 Chapter 4 ACT 4. Verify that the General item is selected in the selection pane. 5. Clear the Automatically preview a selected node when the preview tab is active check box. 6. Click Job in the selection pane. 7. Click the Include node-specific notes when printing check box. 8. Click All in the Output Fields pane. 4.1 Introduction to Data Jobs 9. Click QKB in the selection pane. 10. If necessary, click in the Default Locale field and select English (United States). 11. Click OK to save the change and close the Data Management Studio Options window. 4-7 4-8 Chapter 4 ACT Data Inputs and Data Outputs Data jobs always contain a data input node and a data output node. 12 1 2 Use the nodes in the Data Inputs section to specify different types of input to a data job. Use the nodes in the Data Outputs section to specify different types of output from a data job. Data jobs can have more than one input node and output node. Data Validation Node The Data Validation node is used to filter or flag rows according to the specified condition(s). 13 1 3 The Data Validation node is in the Utilities grouping of nodes. 4.1 Introduction to Data Jobs Run and Preview Tools 4-9 Preview a selected node from job flow. Run Tool 14 1 4 The Run tool submits the job contained in the Data Flow Editor for processing and creates the output(s) specified in the job. The Preview icon initiates a preview run and displays the results at the point of the selected node. A preview of a Data Output node does not show field name changes or deletions. This enables the flexibility to continue your data flow after a Data Output node. In addition, previewing a Data Output node does not create the output. You must run the data job to create the output. Viewing Data Job Logs Review the log after running a data job. 15 1 5 The Log tab displays the status of the last job run. 4-10 Chapter 4 ACT Creating and Running a Simple Data Job In this demonstration, you create a data job that reads records from the Products table (in the dfConglomerate Gifts data source). The data job filters the data, and then writes the filtered records to a text file. 1. If necessary, select Start All Programs DataFlux Data Management Studio 2.2. 2. Verify that the Home tab is selected. 3. Click the Folders riser bar. 4. Click the Basics Demos repository. 5. Click Data Job. a. Double-click the batch_jobs folder. (This action makes this folder the value of the Save in field.) b. Type Ch4D2_Products_Misc in the Name field. c. Click OK. The new data job appears on a tab. New Data Job tab Home tab 4.1 Introduction to Data Jobs 6. Add the Data Source node to the Data Flow Editor. a. Verify that the Nodes riser bar is selected in the resource pane. b. Click in front of the Data Inputs grouping of nodes. c. Double-click the Data Source node. An instance of the node appears in the data flow. A properties window for the node appears. You can access the properties window for any node by right-clicking the node and selecting Properties. d. Type Products Table in the Name field. e. Click 1) Click next to the Input table field. in front of the dfConglomerate Gifts data source. 2) Click Products. 3) Click OK to close the Select Table window. 4-11 4-12 Chapter 4 ACT All the fields from the table are automatically selected because the Output Fields option was set to All in the Data Management Studio Options window. Fields can be deselected or selected for output from the node by using the right and left arrow keys: Selects the highlighted field(s) for output from the node. Selects all fields for output from the node. Removes the highlighted field(s) from the output of the node. Removes all fields from the output of the node. The output names for the fields can be renamed by double-clicking the field name and changing the name. The order of the output fields from the node can be controlled by highlighting field(s) and using the up and down arrow keys. Moves the highlighted field(s) up in the output list. Moves the highlighted field(s) down in the output list. f. Move the CATEGORY field so that it follows the PRODUCT NAME field. 1) Scroll in the Selected list box of Output fields and locate the CATEGORY field. 2) Click the CATEGORY field. 3) Click seven (7) times so that it appears after PRODUCT NAME. g. Click OK to save the changes and close the Data Source Properties window. 4.1 Introduction to Data Jobs The Data Source node appears in the job diagram and the updated information is displayed. 7. Edit the Basic Settings for the Data Source node, and then Preview the data. a. If necessary, click View Show Details Pane. b. If necessary, click the Basic Settings tab. c. Type Data Source as the value for the Description field. 4-13 4-14 Chapter 4 ACT d. Click the Preview tab. e. Right-click the Data Source node and select Preview. A sample of records appears on the Preview tab of the Details panel. This tool toggles the Details pane on and off. This tool previews the data from the selected node Details pane with Preview tab selected. You must explicitly select Preview because you turned off the Auto Preview option. 4.1 Introduction to Data Jobs 8. Add a Data Validation node to the job flow. a. In the Nodes resource pane, click b. Click to collapse the Data Inputs grouping of nodes. in front of the Utilities grouping of nodes. c. Locate the Data Validation node. 4-15 4-16 Chapter 4 ACT d. Double-click the Data Validation node. An instance of the node appears in the data flow. The Data Validation Properties window appears. 9. Specify properties for the Data Validation node. a. Type Filter for Miscellaneous in the Name field. b. Click next to Field and select CATEGORY. c. Click in the Operation field and select equal to. d. Click Single (if necessary) and then type Miscellaneous. e. Click Add Condition. The Expression area is updated. f. In the Audit action area, verify that Fails is selected for Row validation and Remove row from output is selected for as the action. 4.1 Introduction to Data Jobs The final settings for the Data Validation node should resemble the following: g. Click OK to close the Data Validation Properties window. 4-17 4-18 Chapter 4 ACT 10. Establish an appropriate description and preview the Data Validation node. a. If necessary, click the Data Validation node in the job diagram. b. Verify that the Details pane is displayed. c. If necessary, click the Basic Settings tab. d. Type Subset for Misc as the value for the Description field. e. Click the Data Validation node in the data flow diagram. f. Click the Preview tool ( ). A sample of records appears on the Preview tab of the Details panel. g. Verify that the CATEGORY field values are all Miscellaneous. 4.1 Introduction to Data Jobs 11. Add a Text File Output node to the job flow. a. In the Nodes resource pane, click b. Click to collapse the Utilities grouping of nodes. in front of the Data Outputs grouping of nodes. c. Locate the Text File Output node. 4-19 4-20 Chapter 4 ACT d. Double-click the Text File Output node. An instance of the node appears in the data flow. The Text File Output Properties window appears. 12. Specify properties for the Text File Output node. a. Type Miscellaneous Products in the Name field. b. Specify the output file information. 1) Click next to the Output file field. 2) Navigate to S:\Workshop\dqdmp1\Demos\files\output_files. 3) Type Ch4D2_Products_Misc.txt in the File name field. 4) Click Save. c. Specify attributes for the file. 1) Verify that the Text qualifier field is set to “ (double quotation mark). 2) Verify that the Field delimiter field is set to Comma. 3) Click Include header row. 4) Click Display file after job runs. d. Specify only the desired fields of PRODUCT CODE and PRODUCT NAME. 1) Remove all default Selected fields by clicking . 2) Double-click PRODUCT CODE from the Available list box. 3) Double-click PRODUCT NAME from the Available list box. e. Rename PRODUCT CODE for the output file. 1) Double-click PRODUCT CODE in the Output Name field in the Selected list box. 4.1 Introduction to Data Jobs 2) Type PRODUCT CD. 3) Press ENTER. f. Add a note that states that a field name was changed. 1) Click Notes. 2) Type The PRODUCT CODE field was renamed to PRODUCT CD in the output file. in the Notes window. 3) Click OK to close the Notes window. The final settings for the Text File Output node should resemble the following: g. Click OK to save the changes and close the Text File Output Properties window. 4-21 4-22 Chapter 4 ACT 13. Establish an appropriate description and preview the Text File Output node. a. If necessary, click the Text File Output node in the job diagram. b. Verify that the Details pane is displayed. c. If necessary, click the Basic Settings tab. d. Type Output to Text File as the value for the Description field. e. Verify the note entered for the node appears. f. Click the Text File Output node in the data flow diagram. g. Click the Preview tool ( ). A sample of records appears on the Preview tab of the Details panel. h. Verify that the CATEGORY field values are all Miscellaneous. When you preview an output node, all fields are shown and any renames of output fields are not reflected in the preview. This is by design to enable you to write only certain records to an output and still have all the fields for the continuation of your data flow. Previewing does not create the output. Only when the job is run is the output physically created. The data job should now resemble the following: 14. Verify that the added note is reflected on the Text File Output node. a. Locate the Notes icon in the Text File Output node. 4.1 Introduction to Data Jobs b. Click the note icon on the Text File Output node. c. Click to close the notes item. d. Add a Data Target (Insert) node to the job flow. 1) In the Nodes resource pane, verify that the Data Outputs grouping of nodes is expanded. 2) Double-click the Data Target (Insert) node. The node appears in the data flow. The Data Target (Insert) Properties window appears. e. Specify properties for the Data Target (Insert) node. 1) Type Insert Records on Products_Misc Table in the Name field. 2) Click a) Click next to the Output table field. The Select Table window appears. in front of the dfConglomerate Gifts data source. b) Click dfConglomerate Gifts. c) Click to be able to add a table to the selected data connection. d) Type Products_Misc in the Enter a name for the new table field. e) Click OK to close the New Table window. f) Click OK to close the Select Table window. 4-23 4-24 Chapter 4 ACT 3) Click the Delete existing rows check box. 4) Accept the default selected output fields. 5) Click OK to save the changes and close the Data Target (Insert) Properties window. 15. Establish an appropriate description for the Data Target (Insert) node. a. If necessary, click the Data Target (Insert) node in the job diagram. b. Verify that the Details pane is displayed. c. If necessary, click the Basic Settings tab. d. Type Output to Text File as the value for the Description field. 16. Add a note to the data flow. a. Click the Insert Note tool ( ). b. Click and drag the note to position it under the nodes. c. Type the following as text for the note: This job: 1. Reads the records from the Products table 2. Filters for the records where CATEGORY=Miscellaneous 3. Writes the results of two fields (one renamed) to a text file 4. Writes the results of all fields to a new table in dfConglomerate Gifts The data job should now resemble the following: 4.1 Introduction to Data Jobs 17. Group the two output nodes. a. Click the Text File Output node (labeled Miscellaneous Products). b. Hold down the CTRL key and click the Data Target (Insert) node (labeled Insert Records on Product_Misc). c. Right-click one of the selected nodes and select Group Selected Nodes. 1) Type Output Nodes in the Name field. 2) Type Grouping the text file and data target nodes together in the Description field. 3) Click OK to close the Group Properties window. The data job should now resemble the following: To undo the grouping, you can right-click the grouping element and select Ungroup Selected Groups. 4-25 4-26 Chapter 4 ACT 18. Specify properties for the job. a. Click the Settings tab. b. Type Create and run a data job in the Description field. c. Type The job writes Products with the CATEGORY of Miscellaneous to a text file (just two columns), and writes all columns (same subset of data) to a new table. in the Notes field. 19. Select File Save. 4.1 Introduction to Data Jobs 20. Run the job. a. Verify that the Data Flow tab is selected. b. Select Actions Run Data Job. The job runs and the requested text file output is displayed in a Notepad window. c. Select File Exit to close the Notepad window. d. Notice the processing information on each node. 4-27 4-28 Chapter 4 ACT 21. View the (full) Log information. a. Click the Log tab. b. Review the information for each of the nodes. 4.1 Introduction to Data Jobs 4-29 22. Investigate print options. a. If necessary, click the Data Flow tab. b. Right-click in the background of the data flow (not on a particular node) and select Print Print Setup. 1) Click Landscape in the Orientation area. 2) Click OK to close the Page Setup window. The tool found on the Data Flow toolbar can be used as an alternative to rightclicking for some actions. 4-30 Chapter 4 ACT c. Right-click in the background of the data flow (not on a particular node) and select Print Print Preview. The Print Preview window appears. d. Scroll to the last page and verify that the node-specific note appears. e. Click Recall the option that was set to enable the printing of the node-specific notes. to close the Print Preview window. 4.1 Introduction to Data Jobs f. 4-31 Right-click in the background of the data flow (not on a particular node) and select Clear Run Results Clear All Run Results. g. Right-click in the background of the data flow (not on a particular node) and select Save Diagram As Image Entire Diagram. h. Navigate to S:\Workshop\dqdmp1\demos\files\output_files and click Save. (Accept the default name of the .png file.) i. View the newly created .png file. 1) Access a Windows Explorer window. 2) Navigate to S:\Workshop\dqdmp1\demos\files\output_files. 3) Double-click the.png file. The node-specific notes are not “printed” when the diagram is saved as an image. 4) Select File Exit to close the Windows Photo Viewer. 23. Select File Close to close the data job. 4-32 Chapter 4 ACT 24. Verify the location of the data job. a. If necessary, click the Home tab. b. If necessary, click the Folders riser bar. c. Click to expand the Basics Demos repository. d. Click the batch_jobs folder. e. Verify that the new data job (Ch4D2_Products_Misc) exists in this location. 4.1 Introduction to Data Jobs 4-33 Exercises 1. Creating a Simple Data Job Create a data job that uses Data Source, Data Validation, and Text File Output nodes. The final job flow should resemble the following: • • • • • Create the data job in the Basics Exercises batch_jobs folder. Name the data job Ch4E1_Breakfast_Items_OZ. Read from the BREAKFAST_ITEMS table in the dfConglomerate Grocery data connection. Filter for records where UOM is equal to OZ. Write the results to a text file named Ch4E1_Breakfast_Items_OZ.txt in S:\Workshop\dqdmp1\Exercises\files\output_files. Set up the file so that the data is commadelimited and includes a header row. Also, display the file after the job is processed. Move the MANUFACTURER_ID field so that it follows the ID field. • Add a note that explains the three nodes used in this job. • Add an appropriate description and note to the properties/settings of the job. • Save and run the data job. QUESTION: How many records were selected in the filter and therefore written to the text file? Answer: QUESTION: How many records were read from the source table? Answer: 4-34 Chapter 4 ACT 4.2 Data Quality Jobs Objectives Explore applying standardization schemes versus standardization definitions. Explore parsing. Explore casing. Explore identification analysis versus right-fielding. 24 Standardization Scheme A scheme is a simple find-and-replace action based on the information in the scheme file. Data Partial Scheme Original Data Standard Street St St. St ST. St Rd. Rd Road Rd RD. Rd Standardized Data 123 North Main Street, Suite 100 123 North Main St, Suite 100 25 4591 S. Covington Road 4591 S. Covington Rd 29 WASSAU ST. 29 Wassau St 4.2 Data Quality Jobs 4-35 Standardization Definition A standardization definition is more complex than a standardization scheme involves standardization scheme(s) can also parse data and apply regular expression libraries and casing information. Data Prior to Standardization Definition Data After Standardization Mister John Q. Smith, Junior Name dataflux corporation Organization DataFlux Corp 123 North Main Street, Suite 100 Address 123 N Main St, Ste 100 U.S. Country UNITED STATES 9194473000 Phone (919) 447 3000 Mr John Q Smith, Jr 26 If you standardize a data value using both a definition and a scheme, the definition is applied first and then the scheme is applied. Data standardization does not perform a validation of the data (for example, Address Verification). Address verification is a separate component of the DataFlux Data Management Studio application. (It is discussed in another chapter.) 4-36 Chapter 4 ACT Parsing Example Parsed Information Street Number 123 Pre-direction N Address Information Street Name MAIN 123 N MAIN ST APT 201 Street Type ST Post-direction Address Extension APT Address Extension Number 201 27 Parsing separates multi-part field values into multiple, single-part fields. For example, if you have a Name field that includes the value Mr. John Q. Smith III, Esq., you can use parsing to create six separate fields: Name Prefix: Mr. Given Name: John Middle Name: Q. Family Name: Smith Name Suffix: III Name Appendage: Esq. 4.2 Data Quality Jobs 4-37 Casing Example Input Data Dataflux corporation Data After Casing Upper DATAFLUX CORPORATION Lower dataflux corporation Proper Dataflux Corporation Proper(Organization) DataFlux Corporation 28 For best results, if available, select an applicable definition when you use Proper casing. Identification Analysis versus Right Fielding Identification Analysis Customer Customer DataFlux John Q Smith DataFlux Corp Nancy Jones Bob Brown & Sons Identification DataFlux Organization John Q Smith Individual DataFlux Corp Organization Nancy Jones Individual Bob Brown & Sons Organization Right Fielding Customer Name DataFlux John Q Smith John Q Smith DataFlux Corp Nancy Jones 29 Bob Brown & Sons Company DataFlux DataFlux Corp Nancy Jones Bob Brown & Sons Identification analysis and right fielding use the same listing of definitions from the QKB, but in different ways. Identification analysis identifies the type of data in a field and right fielding moves the data into fields based on its identification. 4-38 Chapter 4 ACT Investigating Standardizing, Parsing, and Casing In this demonstration, you create a data job that standardizes selected fields from the Customers table. You parse and case the e-mail information and write the results to a text file. 1. If necessary, select Start All Programs DataFlux Data Management Studio 2.2. 2. Verify that the Home tab is selected. 3. Click the Folders riser bar. 4. Click the Basics Demos repository. 5. Click Data Job. a. Double-click the batch_jobs folder. (This action makes this folder the value of the Save in field.) b. Type Ch4D3_Customers_DataQuality in the Name field. c. Click OK. The new data job appears on a tab. 6. Add the Data Source node to the Data Flow Editor. a. Verify that the Nodes riser bar is selected in the resource pane. b. Click in front of the Data Inputs grouping of nodes. c. Double-click the Data Source node. An instance of the node appears in the data flow, and the properties window for the node appears. d. Type Customers Table in the Name field. e. Click 1) Click next to the Input table field. in front of the dfConglomerate Gifts data source. 2) Click Customers. 3) Click OK to close the Select Table window. 4.2 Data Quality Jobs The final settings for the Data Source Properties should resemble the following: f. Click OK to save changes and close the Data Source Properties window. 7. Establish an appropriate description and preview the Data Source node. a. If necessary, click the Data Source node in the job diagram. b. Verify that the Details pane is displayed. c. If necessary, click the Basic Settings tab. d. Type Data Source as the value for the Description field. e. Click the Data Source node in the data flow diagram. f. Click the Preview tool ( ). A sample of records appears on the Preview tab of the Details panel. 4-39 4-40 Chapter 4 ACT 8. Add a Standardization node to the job flow. a. In the Nodes resource pane, click b. Click to collapse the Data Inputs grouping of nodes. in front of the Quality grouping of nodes. c. Locate the Standardization node. d. Double-click the Standardization node. The node appears in the data flow. The Standardization Properties window appears. 4.2 Data Quality Jobs 4-41 9. Specify properties for the Standardization node. a. Type Standardize Fields in the Name field. b. Double-click each of the following fields to move them from the Available list box to the Selected list box. COMPANY BUSINESS PHONE HOME PHONE MOBILE PHONE FAX NUMBER ADDRESS STATE/PROVINCE ZIP/POSTAL CODE COUNTRY/REGION under Scheme and select Ch3D8 Company Phrase Scheme. c. For the COMPANY field, click under Definition and select Phone. d. For the BUSINESS PHONE field, click e. For the HOME PHONE field, click f. under Definition and select Phone. under Definition and select Phone. For the MOBILE PHONE field, click g. For the FAX NUMBER field, click h. For the ADDRESS field, click under Definition and select Phone. under Definition and select Address. i. For the STATE/PROVINCE field, click State/Province(Abbreviation). under Definition and select j. For the ZIP/POSTAL CODE field, click under Definition and select Postal Code. k. For the COUNTRY/REGION field, click l. under Definition and select Country. Click the Preserve null values check box. m. Click the Add standardization flag field check box. 4-42 Chapter 4 ACT The Standardization Properties window should resemble the following: The standardization for a field can use a standardization definition and/or a scheme. If both are specified, the definition is applied first. Then the scheme is applied on the results from the definition. Selecting the Preserve null values option ensures that if a field is null when it enters the node, then the field is null after the node. Selecting this option if the output will be written to a database table is recommended. n. Click OK to close the Standardization Properties window. 4.2 Data Quality Jobs 10. Establish an appropriate description and preview the Standardization node. a. If necessary, click the Standardization node in the job diagram. b. Verify that the Details pane is displayed. c. If necessary, click the Basic Settings tab. d. Type Standardization as the value for the Description field. e. Click the Standardization node in the data flow diagram. f. Click the Preview tool ( ). A sample of records appears on the Preview tab of the Details panel. g. Scroll to the right to view the _Stnd and _Stnd_flag fields. 11. Add a Parsing node to the job flow. a. In the Nodes resource pane, verify that the Quality grouping of nodes is expanded. b. Double-click the Parsing node. The node appears in the data flow. The Parsing Properties window appears. 4-43 4-44 Chapter 4 ACT 12. Specify properties for the Parsing node. a. Type Parse Email in the Name field. b. Click under Field to parse and select EMAIL. c. Click under Definition and select E-mail. d. Click to choose the Mailbox, Sub-Domain, and Top-Level Domain tokens. e. Click Preserve null values. The Parsing Properties window should resemble the following: f. Click OK to close the Parsing Properties window. 13. Establish an appropriate description and preview the Parsing node. a. If necessary, click the Parsing node in the job diagram. b. Verify that the Details pane is displayed. c. If necessary, click the Basic Settings tab. d. Type Parsing as the value for the Description field. e. Click the Parsing node in the data flow diagram. 4.2 Data Quality Jobs f. Click the Preview tool ( 4-45 ). A sample of records appears on the Preview tab of the Details panel. g. Scroll to the right to view the Mailbox, Sub-Domain and Top-Level Domain fields. 14. Add a Change Case node to the job flow. a. In the Nodes resource pane, verify that the Quality grouping of nodes is expanded. b. Double-click the Change Case node. The node appears in the data flow. The Change Case Properties window appears. c. Type Upper Case EMAIL Fields in the Name field. d. Double-click each of the following fields to move them from the Available list box to the Selected list box. EMAIL Mailbox Sub-Domain Top-Level Domain e. For each of the four selected fields, click under Type and select Upper. f. under Definition and select Upper. For each of the four selected fields, click g. Accept the default output names. h. Click the Preserve null values check box. 4-46 Chapter 4 ACT The Change Case Properties window should resemble the following: i. Click OK to close the Change Case Properties window. 15. Establish an appropriate description and preview the Change Case node. a. If necessary, click the Change Case node in the job diagram. b. Verify that the Details pane is displayed. c. If necessary, click the Basic Settings tab. d. Type Change Case as the value for the Description field. e. Click the Change Case node in the data flow diagram. f. Click the Preview tool ( ). g. Scroll to the right to view the new cased e-mail token fields. 4.2 Data Quality Jobs 16. Add a Text File Output node to the job flow. a. In the Nodes resource pane, click b. Click to collapse the Quality grouping of nodes. in front of the Data Outputs grouping of nodes. c. Double-click the Text File Output node. An instance of the node appears in the data flow. The Text File Output Properties window appears. 17. Specify properties for the Text File Output node. a. Type Customer Info in the Name field. b. Specify the output file information. 1) Click next to the Output file field. 2) Navigate to S:\Workshop\dqdmp1\Demos\files\output_files. 3) Type Ch4D3_CustomerInfo.txt in the File name field. 4) Click Save. c. Specify attributes for the file. 1) Verify that the Text qualifier field is set to “ (double quotation mark). 2) Verify that the Field delimiter field is set to Comma. 3) Click the Include header row check box. 4) Click the Display file after job runs check box. d. Accept all fields as selected fields for the output file. e. Click OK to save the changes. Close the Text File Output Properties window. 4-47 4-48 Chapter 4 ACT 18. Establish an appropriate description and preview the Text File Output node. a. If necessary, click the Text File Output node in the job diagram. b. Verify that the Details pane is displayed. c. If necessary, click the Basic Settings tab. d. Type Output to Text File as the value for the Description field. 19. Select File Save. 20. Run the job. a. Verify that the Data Flow tab is selected. b. Select Actions Run Data Job. c. Verify that the text file appears. d. Select File Exit to close the Notepad window. e. Notice the processing information on each node. 4.2 Data Quality Jobs 21. View the log information. a. Click the Log tab. b. Review the information for each of the nodes. 22. Click the Data Flow tab. 23. Select File Close. 4-49 4-50 Chapter 4 ACT Investigating Right Fielding and Identification Analysis In this demonstration, you examine the difference between identification analysis and right fielding. 1. If necessary, select Start All Programs DataFlux Data Management Studio 2.2. DataFlux Data Management Studio appears. 2. Verify that the Home tab is selected. 3. Click the Folders riser bar. 4. Click the Basics Demos repository. 5. Click Data Job. a. Double-click the batch_jobs folder. (This action makes this folder the value of the Save in field.) b. Type Ch4D4_RightFielding_IDAnalysis in the Name field. c. Click OK. The new data job appears on a tab. 6. Add the Text File Input node to the Data Flow Editor. a. Verify that the Nodes riser bar is selected in the resource pane. b. If necessary, click c. Click to collapse the Data Outputs grouping of nodes. in front of the Data Inputs grouping of nodes. d. Double-click the Text File Input node. An instance of the node appears in the data flow. The properties window for the node appears. 7. Specify properties of the Text File Input node. a. Type Prospect List in the Name field. b. Click next to the Input file field. 4.2 Data Quality Jobs 4-51 1) Navigate to S:\Workshop\dqdmp1\data\Text Files. 2) Click Prospect_List.txt. 3) Click Open to close the Open window. c. Select “ (double quotation mark) for the Text qualifier field. d. Verify that Comma is the value for the Field delimiter field. e. Click Number of rows to skip. f. Verify that 1 (one) is the default value for number of rows to skip. g. Click Import under the Fields area. 1) If necessary, navigate to S:\Workshop\dqdmp1\data\Text Files. 2) Click Prospect_List.dfl. 3) Click Open. The final settings for the Text File Input Properties window should resemble the following: 4-52 Chapter 4 ACT h. Click OK to save the changes and close the Text File Input Properties window. The Text File Input node appears in the job diagram with updated information. 24. Establish an appropriate description and preview the Text File Input node. a. If necessary, click the Text File Input node in the job diagram. b. Verify that the Details pane is displayed. c. If necessary, click the Basic Settings tab. d. Type Text File Input as the value for the Description field. e. Click the Text File Input node in the data flow diagram. f. Click the Preview tool ( ). A sample of records appears on the Preview tab of the Details panel. The Contact field has a mixture of individual names as well as corporate names. Right-fielding and Identification Analysis both used identification definitions to help correctly identify the information found in a field. 4.2 Data Quality Jobs 4-53 8. Add a Right Fielding node to the job flow. a. In the Nodes resource pane, click b. Click to collapse the Data Inputs grouping of nodes. in front of the Quality grouping of nodes. c. Double-click the Right Fielding node. The node appears in the data flow. The Right Fielding Properties window appears. 9. Specify the properties of the Right Fielding node. a. Type Right Field Contact Info in the Name field. b. Click for the Definition field and select Individual/Organization. c. Under the Input fields area, double-click the Contact field to move it from the Available list box to the Selected list box. d. Under the Output fields area, add three new fields. 1) Click Add. 2) Click under Identity and click Unknown. 3) Double-click the Field0 value under Output Name. 4) Type Unknown, and press ENTER. 5) Click Add. 6) Click under Identity and click Organization. 7) Double-click the Field1 value under Output Name. 8) Type Company and press ENTER. 9) Click Add. 10) Click under Identity and click Individual. 11) Double-click the Field2 value under Output Name. 12) Type Person and press ENTER. e. Click Preserve null values. 4-54 Chapter 4 ACT The final settings for the Right Fielding node should resemble the following: 4.2 Data Quality Jobs f. 4-55 Click Additional Outputs. 1) Click the Contact field under Output fields area. 2) Click multiple times to move the Contact field to the bottom of the list. The ordering of the fields should be similar to the following: The Contact field is moved to the end of the list so that it can be more easily compared to the results of the Right Fielding node. 3) Click OK to close the Additional Outputs window. g. Click OK to close the Right Fielding Properties window. 4-56 Chapter 4 ACT 10. Establish an appropriate description and preview the Right Fielding node. a. If necessary, click the Right Fielding node in the job diagram. b. Verify that the Details pane is displayed. c. If necessary, click the Basic Settings tab. d. Type Right Fielding as the value for the Description field. e. Click the Right Fielding node in the data flow diagram. f. Click the Preview tool ( ). A sample of records appears on the Preview tab of the Details panel. g. Scroll to the right to view the new column information. The Right Fielding node correctly identified data values from the Contact field as either Company (Organization) values, Person (Individual) values, or Unknown values. 11. Add an Identification Analysis node to the job flow. a. Verify that the Quality grouping of nodes is expanded. b. Double-click the Identification Analysis node. The node appears in the data flow. The Identification Analysis Properties window appears. 4.2 Data Quality Jobs 12. Specify the properties for the Identification Analysis node. a. Type Identify Contact Info in the Name field. b. Click Contact in the Available list box in the Identification analysis fields area. c. Click to move Contact to the Selected list box. d. Specify the correct identification definition for the selected Contact field. 1) Click under Definition for the Contact field. 2) Select Individual/Organization. 3) Verify that Contact_Identity is the value for the Output Name field. e. Click Preserve null values. The final settings for the Identification Analysis node should resemble the following: f. Click OK to close the Identification Analysis Properties window. 4-57 4-58 Chapter 4 ACT 13. Establish an appropriate description and preview the Identification Analysis node. a. If necessary, click the Identification Analysis node in the job diagram. b. Verify that the Details pane is displayed. c. If necessary, click the Basic Settings tab. d. Type Identification Analysis as the value for the Description field. e. Click the Identification Analysis node in the data flow diagram. f. Click the Preview tool ( ). A sample of records appears on the Preview tab of the Details panel. g. Scroll to the right to view the new column information. The Identification Analysis node also correctly identified data values from the Contact field as either Organization or Individual values. Right fielding moves the data to an appropriate field, but identification analysis creates a field with values to indicate whether each observation is an INDIVIDUAL or ORGANIZATION value. 14. Add a Text File Output node to the job flow. a. In the Nodes resource pane, click b. Click to collapse the Quality grouping of nodes. in front of the Data Outputs grouping of nodes. c. Double-click the Text File Output node. An instance of the node appears in the data flow. The Text File Output Properties window appears. 4.2 Data Quality Jobs 15. Specify properties for the Text File Output node. a. Type Prospects RF and ID in the Name field. b. Specify the output file information. 1) Click next to the Output file field. 2) Navigate to S:\Workshop\dqdmp1\Demos\files\output_files. 3) Type Ch4D4_Prospects_RF_ID.txt in the File name field. 4) Click Save. c. Specify attributes for the file. 1) Verify that the Text qualifier field is set to “ (double quotation mark). 2) Verify that the Field delimiter field is set to Comma. 3) Click the Include header row check box. 4) Click the Display file after job runs check box. d. Accept all fields as selected fields for the output file. e. Click OK to save the changes and close the Text File Output Properties window. 16. Establish an appropriate description for the Text File Output node. a. If necessary, click the Text File Output node in the job diagram. b. Verify that the Details pane is displayed. c. If necessary, click the Basic Settings tab. d. Type Output to Text File as the value for the Description field. 17. Save the job. a. Click the Data Flow tab. b. Select File Save. 4-59 4-60 Chapter 4 ACT 18. Run the job. a. Verify that the Data Flow tab is selected. b. Select Actions Run Data Job. c. Verify that the text file appears. d. Select File Exit to close the Notepad window. e. Notice the processing information on each node. 19. View the log information. a. Click the Log tab. b. Review the information for each of the nodes. 20. Select File Close. 4.2 Data Quality Jobs 4-61 Exercises 2. Creating a Data Job that Involves Standardization and Parsing Create a data job that uses the Standardization, Parsing, and Data Target (Insert) nodes. The final job flow should resemble the following: • Create a new data job named Ch4E2_Manufacturers_DataQuality in the batch_jobs folder of the Basics Exercises repository. • Add the MANUFACTURERS table in the dfConglomerate Grocery data connection as the data source. Select the following fields. ID MANUFACTURER CONTACT CONTACT_ADDRESS CONTACT_CITY CONTACT_STATE_PROV CONTACT_POSTAL_CD CONTACT_CNTRY CONTACT_PHONE POSTDATE • Standardize the following fields. Accept the default names for the standardized fields. Field Name Definition MANUFACTURER Ch3D8 Company Phrase Scheme CONTACT Name CONTACT_ADDRESS Address CONTACT_STATE_PROV State/Province (Abbreviation) CONTACT_CNTRY Scheme Ch3E6 CONTACT_CNTRY Scheme 4-62 Chapter 4 ACT • Parse the standardized CONTACT field using the Name parse definition. Be sure to preserve null values. Select only three tokens and rename the output fields according to the following: Token Name • Output Name Given Name FIRST_NAME Middle Name MIDDLE_NAME Family Name LAST_NAME Output the standardized, parsed data to a new table named MANUFACTURERS_STND (in the dfConglomerate Grocery data connection). If the data job runs multiple times, ensure that the records for each run are the only records in the table. In addition, only output the following fields with output names: Field Name Output Name ID ID MANUFACTURER_Stnd MANUFACTURER FIRST_NAME FIRST_NAME MIDDLE_NAME MIDDLE_NAME LAST_NAME LAST_NAME CONTACT_ADDRESS_Stnd CONTACT_ADDRESS CONTACT_CITY CONTACT_CITY CONTACT_STATE_PROV_Stnd CONTACT_STATE_PROV CONTACT_POSTAL_CD CONTACT_POSTAL_CD CONTACT_CNTRY_Stnd CONTACT_CNTRY CONTACT_PHONE CONTACT_PHONE POSTDATE POSTDATE • Save and run the data job. • Review the log and close the data job. • Verify that the records were written to the MANUFACTURERS_STND table in the dfConglomerate Grocery data connection. 4.2 Data Quality Jobs 4-63 Using a Standardization Definition and a Standardization Scheme In this demonstration, you create a data job to standardize a text file and output the standardized information to another text file. This output text file is used as a source for a profile. The profile reveals that the standardization of a State field does not “correct” all values. A scheme is created to cover the values that are an issue. The initial data job is updated to use the scheme in addition to the original definition. The resultant text file is profiled and reveals the proper correction of the State field. 1. If necessary, select Start All Programs DataFlux Data Management Studio 2.2. 2. Verify that the Home tab is selected. 3. Click the Folders riser bar. 4. Click the Basics Demos repository. 5. Click Data Job. a. Double-click the batch_jobs folder. (This action makes this folder the value of the Save in field.) b. Type Ch4D5_StandardizationDefinitionAndScheme in the Name field. c. Click OK. The new data job appears on a tab. 4-64 Chapter 4 ACT 6. Add the Text File Input node to the Data Flow Editor. a. Verify that the Nodes riser bar is selected in the Resource pane. b. If necessary, click c. Click to collapse the Data Outputs grouping of nodes. in front of the Data Inputs grouping of nodes. d. Double-click the Text File Input node. An instance of the node appears in the data flow and the properties window for the node appears. 7. Specify the properties of the Text File Input node. a. Type Prospective Customers in the Name field. b. Click next to the Input file field. 1) Navigate to S:\Workshop\dqdmp1\data\Text Files. 2) Click Prospect_List.txt. 3) Click Open. c. Select “ (double quotation mark) as the value for the Text qualifier field. d. Verify that Comma is the value for the Field delimiter field. e. Click Number of rows to skip. f. Verify that 1 (one) is the default value for the number of rows to skip. g. Click Import under the Fields area. 1) If necessary, navigate to S:\Workshop\dqdmp1\data\Text Files. 2) Click Prospect_List.dfl. 3) Click Open. h. Click OK to close the Text File Input Properties window. 8. Edit the Basic Settings for the Text File Input node, and then Preview the data. a. If necessary, click View Show Details Pane. b. If necessary, click the Basic Settings tab. c. Type Text File Input as the value for the Description field. d. Click the Preview tab. e. Right-click the Text File Input node and select Preview. 4.2 Data Quality Jobs 9. Add a Standardization node to the data flow. a. In the Nodes resource pane, click b. Click to collapse the Data Inputs grouping of nodes. in front of the Quality grouping of nodes. c. Double-click the Standardization node. The node appears in the data flow. The Standardization Properties window appears. 10. Specify the properties for the Standardization node. a. Type Standardize State in the Name field. b. Double-click the State field to move it from the Available list box to the Selected list box. c. For the State field, click under Definition and select State/Province(Abbreviation). d. Click the Preserve null values check box. The Standardization Properties window should resemble the following: e. Click OK to close the Standardization Properties window. 11. Edit the Basic Settings for the Standardization node. a. If necessary, click View Show Details Pane. 4-65 4-66 Chapter 4 ACT b. If necessary, click the Basic Settings tab. c. Type Standardization as the value for the Description field. 12. Add a Text File Output node to the job flow. a. In the Nodes resource pane, click b. Click to collapse the Quality grouping of nodes. in front of the Data Outputs grouping of nodes. c. Double-click the Text File Output node. An instance of the node appears in the data flow. The Text File Output Properties window appears. 13. Specify the properties for the Text File Output node. a. Type Customer Info – Standard State in the Name field. b. Specify the output file information. 1) Click next to the Output file field. 2) Navigate to S:\Workshop\dqdmp1\Demos\files\output_files. 3) Type Ch4D5_Customer_StdState.txt in the File name field. 4) Click Save. c. Specify attributes for the file. 1) Verify that the Text qualifier field is set to “ (double quotation mark). 2) Verify that the Field delimiter field is set to Comma. 3) Click Include header row. 4) Click Display file after job runs. d. Accept all fields as selected fields for the output file. e. Export the field layout. 1) Click Export. 2) If necessary, navigate to S:\Workshop\dqdmp1\Demos\files\output_files. 3) Type Ch4D5_Customer_StdState.dfl in the File name field. 4) Click Save. f. Click OK to save the changes and close the Text File Output Properties window. 14. Edit the Basic Settings for the Text File Output node. a. If necessary, click View Show Details Pane. b. If necessary, click the Basic Settings tab. 4.2 Data Quality Jobs c. Type Text File Output as the value for the Description field. The data job should resemble the following: 15. Save the job. a. Click the Data Flow tab. b. Select File Save. 16. Run the job. a. Verify that the Data Flow tab is selected. b. Select Actions Run Data Job. c. Verify that the text file appears. d. Select File Exit to close the Notepad window. e. Notice the processing information on each node. 4-67 4-68 Chapter 4 ACT Profiling State and State_Stnd and Creating a Scheme 1. Click the Home tab. 1. If necessary, click the Folders riser bar. 2. Click the Basics Demos repository. 3. Click Profile. a. Double-click the profiles_and_explorations folder. (This action makes this folder the value of the Save in field.) b. Type Ch4D5_TextFile_Profile in the Name field. c. Click OK. The new profile appears on a tab. 4. On the Properties tab, click Text Files. 5. Select Insert New Text File. a. Type Customer Info in the Table name field. b. Click OK to close the Insert Text File window. 6. Specify delimited file information. a. Click next to the Input file name field. The Open window appears. 4.2 Data Quality Jobs 1) Navigate to S:\Workshop\dqdmp1\Demos\files\output_files. 2) Click the Ch4D5_Customer_StdState.txt file. 3) Click Open. The path and filename are returned to the Input file name field. b. Click Import below the Fields area. 1) If necessary, navigate to S:\Workshop\dqdmp1\Demos\files\output_files. 2) Click the Ch4D5_Customer_StdState.dfl file. 3) Click Open. c. Click Number of rows to skip and verify that the default value is set to 1 (one). d. Click OK to close the Delimited File Information window. The fields area is populated with information about the fields in the text file. 7. Click the State check box. 8. Click the State_Stnd check box. 9. Select all metrics. a. Select Tools Default Profile Metrics. b. Verify that all metrics are selected. c. Click OK to close the Metrics window. 10. Select File Save Profile to save the profile. 11. Select Actions Run Profile Report. a. Type Profiling State fields in the Description field. b. Click OK to close the Run Profile window. The profile executes. The status of the execution is displayed. The Report tab becomes active. 12. Review the Profile report. a. Click in front of Customer Info to view the available fields. b. Click the State field. c. If necessary, click the Column Profiling tab. 4-69 4-70 Chapter 4 ACT Notice the Unique Count metric. d. Click the Frequency Distribution tab. e. Click the State_Stnd field. f. Click the Column Profiling tab. 4.2 Data Quality Jobs Notice the Pattern Count and Unique Count metrics. g. Click the Frequency Distribution tab. 13. Create a scheme to “cover” the value PENSILVANEYA. a. Click the Report tab. b. Click the Customer Info text file. c. If necessary, click the Standard Metrics tab. d. On the Standard Metrics tab, click the State_Stnd field. e. Select Actions Build a Scheme. The Report Generation window appears. 4-71 4-72 Chapter 4 ACT f. Click OK to close the Report Generation window. The Scheme Builder window appears. g. At the bottom of the Report side, type PA in the Standard field. h. Click the value PENSILVANEYA. i. Click Add To Scheme. j. Select File Save. 1) Type Ch4D5 State_PA_Scheme in the Name field. 2) Click Save. k. Select File Exit to close the Scheme Builder window. Editing and Re-executing the Data Job and the Profile 1. Click the Ch4D5_StandardizationDefinitionAndScheme tab. 2. Edit the Standardize State node. a. Right-click the Standardize State node and select Properties. b. For the State field, click under Scheme and select Ch4D5/Province(Abbreviation). c. Click OK to save the updated standardization properties. 3. Save the job. a. Click the Data Flow tab. b. Select File Save. 4. Run the job. a. Verify that the Data Flow tab is selected. b. Select Actions Run Data Job. c. Verify that the text file appears. d. Select File Exit to close the Notepad window. 5. Re-run the profile. a. Click the Ch4D5_TextFile_Profile tab. 4.2 Data Quality Jobs b. Select Actions Run Profile Report. 1) Type Profiling Updated State field in the Description field. 2) Click OK to close the Run Profile window. The profile executes. The status of the execution is displayed. The Report tab becomes active. 6. Review the Profile report. a. Click in front of Customer Info to view the available fields. b. Click the State_Stnd field. c. If necessary, click the Column Profiling tab. Notice the Pattern Count and Unique Count metrics. 4-73 4-74 Chapter 4 ACT d. Click the Frequency Distribution tab. 7. Select File Close Profile. 8. Select File Close (to close the data job). 4.3 Data Enrichment Jobs (Self-Study) 4-75 4.3 Data Enrichment Jobs (Self-Study) Objectives Explore address verification and geocoding. 46 Address Verification Verified Address Information Original Address 940 Cary Parkway 27513 Street Address 940 NW CARY PKWY City CARY State NC Zip 27513 Zip+4 2792 County Name WAKE Congressional District 4 47 Address verification identifies, corrects, and enhances address information. Data files can be licensed to verify address information for U.S., Canada, and international addresses. 4-76 Chapter 4 ACT Geocoding Original Address Verified Geocode Information 940 NW CARY PKWY Latitude 35.811753 CARY , NC 27513-2792 Longitude -78.802326 48 Geocoding latitude and longitude information can be used to map locations and plan efficient delivery routes. Geocoding can be licensed to return this information for the centroid of the postal code or at the roof-top level. Currently, there are only geocoding data files for the United States and Canada. Also, currently, roof-top level geocoding is available only for the United States. 4.3 Data Enrichment Jobs (Self-Study) 4-77 Performing Address Verification and Geocoding In this demonstration, you use an Address Verification node to verify the addresses present in the Customers table from dfConglomerate Gifts. The resultant verified ZIP field is used in a Geocoding node to produce additional desired geocode columns. This final result set is written to a text file. 1. If necessary, select Start All Programs DataFlux Data Management Studio 2.2. 2. Verify that the Home tab is selected. 3. Click the Folders riser bar. 4. Click the Basics Demos repository. 5. Click Data Job. a. Double-click the batch_jobs folder. (This action makes this folder the value of the Save in field.) b. Type Ch4D6_AddressVerification_Example in the Name field. c. Click OK. The new data job appears on a tab. 6. Add the Data Source node to the Data Flow Editor. a. Verify that the Nodes riser bar is selected in the Resource pane. b. Click in front of the Data Inputs grouping of nodes. 4-78 Chapter 4 ACT c. Double-click the Data Source node. An instance of the node appears in the data flow. The properties window for the node appears. d. Type Customers Table in the Name field. e. Click 1) Click next to the Input table field. in front of the dfConglomerate Gifts data source. 2) Click Customers. 3) Click OK to close the Select Table window. f. Click OK to save the changes and close the Data Source Properties window. 7. Edit the Basic Settings for the Data Source node. a. If necessary, click View Show Details Pane. b. If necessary, click the Basic Settings tab. c. Type Data Source as the value for the Description field. 4.3 Data Enrichment Jobs (Self-Study) 8. Add an Address Verification (US/Canada) node to the job flow. a. In the Nodes resource pane, click b. Click to collapse the Data Inputs grouping of nodes. in front of the Enrichment grouping of nodes. c. Double-click the Address Verification (US/Canada) node. The node appears in the data flow. The Address Verification (US/Canada) Properties window appears. 9. Specify properties for the Address Verification (US/Canada) node. a. Type Address Verification in the Name field. b. Verify that United States is selected in the Address area. 4-79 4-80 Chapter 4 ACT c. Specify the Input information. 1) Click under Field Type for the ADDRESS field and click Address Line 1. 2) Click under Field Type for the CITY field and click City. 3) Click under Field Type for the STATE/PROVINCE field and click State. 4) Click under Field Type for the ZIP/POSTAL CODE field and click Zip. d. Specify options for the address verification. 1) Click Options. 2) Click Proper case results. 3) Click Street abbreviation. 4) Click City abbreviation. 5) Clear CASS compliance. 6) Verify that Insert dash between ZIP and ZIP4 is selected. 7) Click OK to save the settings and close the Options window. 4.3 Data Enrichment Jobs (Self-Study) 4-81 e. Specify the output fields information. 1) Double-click Address Line 1 to move this field from the Available list box to the Selected list box. 2) Double-click City to move this field from the Available list box to the Selected list box. 3) Double-click State to move this field from the Available list box to the Selected list box. 4) Double-click ZIP/Postal_Code to move this field from the Available list box to the Selected list box. 5) Double-click US_ZIP to move this field from the Available list box to the Selected list box. 6) Double-click US_ZIP4 to move this field from the Available list box to the Selected list box. 7) Double-click US_Result_Code to move this field from the Available list box to the Selected list box. 8) Double-click US_Numeric_Result_Code to move this field from the Available list box to the Selected list box. 9) For the Output Name field, type _Verified after Address_Line_1. 10) For the Output Name field, type _Verified after City. 11) For the Output Name field, type _Verified after State. 12) For the Output Name field, type _Verified after ZIP/Postal_Code. The final set of selected output fields should resemble the following: f. Limit the original fields in output. 1) Click Additional Outputs. 2) Click the EMAIL field in the Output fields area, and then click 3) Click the JOB TITLE field in the Output fields area, and then click . . 4) Click the BUSINESS PHONE field in the Output fields area, and then click . 4-82 Chapter 4 ACT 5) Click the HOME PHONE field in the Output fields area, and then click . . 6) Click the MOBILE PHONE field in the Output fields area, and then click 7) Click the FAX NUMBER field in the Output fields area, and then click . 8) Click the COUNTRY/REGION field in the Output fields area, and then click 9) Click the NOTES field in the Output fields area, and then click . . The final set of additional output fields should resemble the following: 10) Click OK to close the Additional Outputs window. g. Click OK to close the Address Verification (US/Canada) Properties window. 10. Establish an appropriate description and preview the Address Verification (US/Canada) node. a. If necessary, click the Address Verification node in the job diagram. b. Verify that the Details pane is displayed. c. If necessary, click the Basic Settings tab. d. Type Verify Addresses as the value for the Description field. e. Click the Address Verification node in the data flow diagram. f. Click the Preview tool ( ). A sample of records appears on the Preview tab of the Details panel. g. Scroll to view the results. 4.3 Data Enrichment Jobs (Self-Study) 4-83 Notice the following: • The street types in the original Address field are abbreviated in the verified Address field. • Some verified city values have been updated to a new value that corresponds to the zip code information. • Some city “words” in the original City field are abbreviated in the verified city field. • Some States field values not of the pattern AA are verified as state values with the pattern AA. • Most of the preview sample have a U.S. result code of OK and U.S. numeric result code of 0. The US_Result_Code field indicates whether the address was successfully verified. In the case where the address was not successfully verified, a U.S. return code indicates the cause of address verification failure. A numeric result to the US_Result_Code field is the US_Numeric_Result_Code field. Possible values for both fields are as follows: Text Result Code Numeric Result Code Description OK 0 Address was verified successfully. PARSE 11 Error parsing address. Components of the address might be missing. CITY 12 Could not locate city, state, or ZIP code in the USPS database. At least city and state or ZIP code must be present in the input. MULTI 13 Ambiguous address. There are two or more possible matches for this address with different data. NOMATCH 14 No matching address is found in the USPS data. OVER 15 One or more input strings is too long (maximum 100 characters). 4-84 Chapter 4 ACT 11. Add a Geocoding node to the job flow. a. If necessary, click in front of the Enrichment grouping of nodes. b. Double-click the Geocoding node. The node appears in the data flow. The Geocoding Properties window appears. 12. Specify the properties for the Geocoding node. a. Type Geocoding using Zip Verified field in the Name field. under Field Type for the ZIP/Postal_Code_Verified field. b. In the Input area, click Click ZIP+4 (US Only). c. In the Output fields area, click to move all available fields to the Selected list box. d. Click OK to save the settings and close the Geocoding Properties window. 4.3 Data Enrichment Jobs (Self-Study) 13. Establish an appropriate description and preview the Geocoding node. a. If necessary, click the Geocoding node in the job diagram. b. Verify that the Details pane is displayed. c. If necessary, click the Basic Settings tab. d. Type Geocoding as the value for the Description field. e. Click the Geocoding node in the data flow diagram. f. Click the Preview tool ( ). A sample of records appears on the Preview tab of the Details panel. g. Scroll to view the results. Notice the following: • The Geocode_Result_Code field has mainly OK as a returned value. • Records where the verified ZIP code was only five-digits have a Geocode_Result_Code field value of OK5DIGIT. These records have only Geocode_Latitude and Geocode_Longitude as additional returned information. The remaining output fields are blank. The field definitions are shown below: Available Field Name Description Geocode_Result_Code The result code indicates whether the record was successfully geocoded. Other possible codes are as follows: • • • • DP - The match is based on the delivery point. PLUS4 - The match failed on the delivery point, so the match is based on ZIP+4. ZIP - The ZIP+4 match failed, so the match is based on the ZIP code. NOMATCH - The first three checks failed, so there is no match in the geocoding database. Geocode_Latitude This is the numerical horizontal map reference for address data. Geocode_Longitude This is the vertical map reference for address data. Geocode_Census_Tract This is a U.S. Census Bureau reference number assigned using the centroid latitude and longitude. This number contains references to the State and County codes. Geocode_FIPS The U.S. Census Bureau uses a Federal Information Processing Standards (FIPS) number to refer to certain data. Geocode_Census_Block This is the last four digits of the State/County/Tract/Block value. 4-85 4-86 Chapter 4 ACT 14. Add a Text File Output node to the job flow. a. In the Nodes resource pane, click b. Click to collapse the Enrichment grouping of nodes. in front of the Data Outputs grouping of nodes. c. Double-click the Text File Output node. An instance of the node appears in the data flow. The Text File Output Properties window appears. 15. Specify properties for the Text File Output node. a. Type Customer Info – Verify/Geocode in the Name field. b. Specify the output file information. 1) Click next to the Output file field. 2) Navigate to S:\Workshop\dqdmp1\Demos\files\output_files. 3) Type Ch4D6_Customer_Verify_Geocode.txt in the File name field. 4) Click Save. c. Specify attributes for the file. 1) Verify that the Text qualifier field is set to “ (double quotation mark). 2) Verify that the Field delimiter field is set to Comma. 3) Click Include header row. 4) Click Display file after job runs. d. Accept all fields as selected fields for the output file. e. Export the field layout. 1) Click Export. 2) If necessary, navigate to S:\Workshop\dqdmp1\Demos\files\output_files. 3) Type Ch4D6_Customer_Verify_Geocode.dfl in the File name field. 4) Click Save. f. Click OK to save the changes and close the Text File Output Properties window. 16. Establish an appropriate description for the Text File Output node. a. If necessary, click the Text File Output node in the job diagram. b. Verify that the Details pane is displayed. c. If necessary, click the Basic Settings tab. d. Type Output Text File as the value for the Description field. 4.3 Data Enrichment Jobs (Self-Study) The data job should resemble the following: 17. Save the job. a. Click the Data Flow tab. b. Select File Save. 18. Run the job. a. Verify that the Data Flow tab is selected. b. Select Actions Run Data Job. c. Verify that the text file appears. d. Select File Exit to close the Notepad window. e. Notice the processing information on each node. 19. Select File Close. 4-87 4-88 Chapter 4 ACT Exercises 3. Performing Address Verification on the MANUFACTURERS Table Create a data job that works with the Address Verification node. The final job flow should resemble the following: • Create a data job (Ch4E3_MANUFACTURERS_Verify) in the batch_jobs folder of the Basics Exercises repository. • Use the MANUFACTURERS table from dfConglomerate Grocery as the data source. • Add an Address Verification (US/Canada) node to the job flow. Use the following specifications: − Map the following input fields: − Field Name Field Type MANUFACTURER Firm STREET_ADDR Address Line 1 CITY City STATE_PROV State POSTAL_CD Zip Specify the following options: Proper case results Output blanks as nulls Street abbreviation City abbreviation Insert dash between ZIP when ZIP4 is selected 4.3 Data Enrichment Jobs (Self-Study) − Specify the following output fields: Output Type Output Name Output Type Output Name Firm Firm_Verified ZIP/Postal Code ZIP_Verified Address Line 1 Address_Verified US County Name US_County_Name City City_Verified US Result Code US_Result_Code State State_Verified • Retain only the following list of original fields for additional outputs: ID STATE/PROV MANUFACTURER POSTAL_CD STREET_ADDR COUNTRY CITY PHONE • Add a Text File Output node to the job flow. Use the following specifications: Output File: S:\Workshop\dqdmp1\Exercises\files\OutputFiles\Ch4E3_Manufacturers_Verify.txt. Text qualifier: “ (double quotation mark) Field delimiter: Comma Include header row Display the file after job runs. Specify the following fields for the text file with the specified output name: • Field Name Output Name ID ID Firm_Verified Manufacturer Address_Verified Address City_Verified City State_Verified State ZIP_Verified ZIP US_County_Name US_County_Name US_Result_Code US_Result_Code Save and run the job. Verify that the text file is created properly. 4-89 4-90 Chapter 4 ACT 4.4 Entity Resolution Jobs Objectives 53 Define match codes. Describe cluster conditions. 4.4 Entity Resolution Jobs 4-91 Match Codes Name Match Code @ 85 Sensitivity John Q Smith [email protected]$$$$$$ Johnny Smith [email protected]$$$$$$ Jonathon Smythe [email protected]$$$$$$ Match code generation process: 1. Data is parsed into its components (for example, Given Name and Family Name). 2. Ambiguities and noise words are removed (for example, the). 3. Transformations are made (for example, Jonathon > Jon). 4. Phonetics are applied (for example, PH > F). 5. Based on the sensitivity selection, the following occurs: Relevant components are determined. Only a certain number of characters of the transformed relevant components are used. 54 DataFlux uses the strengths of probabilistic and deterministic matching. DataFlux provides dichotomous (true/false) results. It also takes advantage of the flexibility in scoring of the probabilistic matching technique. With a single pass through the data, the DataFlux matching engine produces an unambiguous match code that represents the identifying variables that were specified by the user. The key to the DataFlux technology is this match code. • Prior to match code generation, the matching engine parses the data into its components and implicitly removes all ambiguities in the system. Using industry-leading data quality algorithms, it removes nonessential matching information and standardizes the data. • The matching engine enables near matching by enabling you to specify a match threshold (sensitivity) for each identifying variable. Match codes are generated in direct correlation to the desired sensitivity. If the sensitivity is higher, the matching requirements are more stringent and the match code is more specific. This enables you to specify high matching requirements for some variables and lower requirements for others. • The matching engine enables the specification of configurable business rules that impact the match. The rules are implicitly applied before generating the match code. This enables you to make specific corrections in the data before performing any linkage. Changing the matching logic requires the use of the Data Management Studio – Customize module. • Regardless of the data source, the matching engine creates the same match code as long as the same match definition and sensitivity setting are selected when the match code is generated. This enables matching across multiple data sources without constantly reprocessing and re-indexing the data, which is something that probabilistic matching systems cannot do. 4-92 Chapter 4 ACT Clustering (Grouping Records) Grouping on the fields: Name_MC_85 and Address_MC_85 (matches records 1 and 2) Name_MC_85 and Phone (matches records 2 and 3) Grouping on both conditions: Name_MC_85 and Address_MC_85 or Name_MC_85 and Phone (causes all 3 records to match) 55 The Clustering node provides the ability to match records based on multiple conditions. Create the conditions that support your business needs. 4.4 Entity Resolution Jobs 4-93 Creating a Data Job to Cluster Records In this demonstration, you will generate clusters of data based on three conditions. The three conditions will take advantage of match codes generated on various fields, as well as standardized field information. A match report will display the results of the clustering. 1. If necessary, select Start All Programs DataFlux Data Management Studio 2.2. 2. Verify that the Home tab is selected. 3. Click the Folders riser bar. 4. Click the Basics Demos repository. 5. Click Data Job. a. Double-click the batch_jobs folder. (This action makes this folder the value of the Save in field.) b. Type Ch4D7_ClusterRecords_Example in the Name field. c. Click OK. The new data job appears on a tab. 6. Add the Data Source node to the Data Flow Editor. a. Verify that the Nodes riser bar is selected in the resource pane. b. Click in front of the Data Inputs grouping of nodes. c. Double-click the Data Source node. An instance of the node appears in the data flow. The properties window for the node appears. d. Type Customers Table in the Name field. e. Click 1) Click next to the Input table field. in front of the dfConglomerate Gifts data source. 2) Click Customers. 3) Click OK to close the Select Table window. 4-94 Chapter 4 ACT f. Click OK to save the changes and close the Data Source Properties window. 7. Establish an appropriate description for the Data Source node. a. If necessary, click the Data Source node in the job diagram. b. Verify that the Details pane is displayed. c. If necessary, click the Basic Settings tab. d. Type Data Source as the value for the Description field. 8. Add a Match Codes (Parsed) node to the job flow. a. In the Nodes resource pane, click b. Click to collapse the Data Inputs grouping of nodes. in front of the Entity Resolution grouping of nodes. c. Double-click the Match Codes (Parsed) node. The node appears in the data flow. The Match Codes (Parsed) Properties window appears. The (Parsed) nodes are used when the input data is parsed. Because the Name Match definition is designed for a full name, if you only generate a match code on FIRST_NAME, in most cases the definition would assume this was a last name. Thus, the matching of nicknames would not happen. For example, Jon would not match John. 4.4 Entity Resolution Jobs 9. Specify properties for the Match Codes (Parsed) node. a. Type Match Codes for Parsed Name in the Name field. b. Type Name_MatchCode in the Output field field. c. Click under Sensitivity and select 85. d. Click under Definition and select Name. e. Click under Field Name for the Given Name token and click FIRST NAME. f. under Field Name for the Family Name token and click LAST NAME. Click g. Click Generate null match codes for blank field values. h. Click Preserve null values. The final settings for the Match Codes (Parsed) node should resemble the following: 4-95 4-96 Chapter 4 ACT i. Click OK to close the Match Codes (Parsed) Properties window. The Allow generation of multiple match codes per definition option requires the creation of a special match definition in the Data Management Studio – Customize module. The Generate null match codes for blank field values option generates a NULL match code if the field is blank. If this option is not selected, then a match code of all $ symbols is generated for the field. When you match records, a field with NULL does not equal another field with NULL. A field with all $ symbols equals another field with all $ symbols. 10. Establish an appropriate description and preview the Match Codes (Parsed) node. a. If necessary, click the Match Codes (Parsed) node in the job diagram. b. Verify that the Details pane is displayed. c. If necessary, click the Basic Settings tab. d. Type Parsed Input as the value for the Description field. e. Click the Match Code (Parsed) node in the data flow diagram. f. Click the Preview tool ( ). A sample of records appears on the Preview tab of the Details panel. g. Scroll to the right to view the Name_MatchCode field. 4.4 Entity Resolution Jobs 4-97 11. Add a Match Codes node to the job flow. a. Verify that the Entity Resolution grouping of nodes is expanded. b. Double-click the Match Codes node. The node appears in the data flow. The Match Codes Properties window appears. 12. Specify properties for the Match Codes node. a. Type Match Codes for Various Fields in the Name field. b. Double-click the COMPANY field to move it from the Available list box to the Selected list box. c. Click under Definition and select Organization. d. Accept the default Sensitivity of 85. e. Double-click the ADDRESS field to move it from the Available list box to the Selected list box. f. Click under Definition and select Address. g. Accept the default Sensitivity of 85. h. Double-click the CITY field to move it from the Available list box to the Selected list box. under Definition and select City. i. Click j. Accept the default Sensitivity of 85. k. Double-click the STATE/PROVINCE field to move it from the Available list box to the Selected list box. l. Click under Definition and select State/Province. m. Accept the default Sensitivity of 85. n. Double-click the ZIP/POSTAL CODE field to move it from the Available list box to the Selected list box. o. Click under Definition and select Postal Code. p. Accept the default Sensitivity of 85. q. Click Generate null match codes for blank field values. r. Click Preserve null values. 4-98 Chapter 4 ACT The final settings for the Match Codes Properties window should resemble the following: s. Click OK to close the Match Codes Properties window. 4.4 Entity Resolution Jobs 4-99 13. Establish an appropriate description and preview the Match Codes node. a. If necessary, click the Match Codes node in the job diagram. b. Verify that the Details pane is displayed. c. If necessary, click the Basic Settings tab. d. Type Match Codes as the value for the Description field. e. Click the Match Codes node in the data flow diagram. f. Click the Preview tool ( ). A sample of records appears on the Preview tab of the Details panel. g. Scroll to the right to view the _MatchCode fields. Generating a match code performs an out-of-the-box standardization behind the scenes. Unless you want to write the standardized values or you want to perform custom standardizations by using a scheme or a modified definition, you do not need to standardize before generating match codes. 14. Add a Standardization node to the job flow. a. In the Nodes resource pane, click b. Click to collapse the Entity Resolution grouping of nodes. in front of the Quality grouping of nodes. c. Double-click the Standardization node. The node appears in the data flow. The Standardization Properties window appears. 4-100 Chapter 4 ACT 15. Specify properties for the Standardization node. a. Type Standardize Phone in the Name field. b. Double-click BUSINESS PHONE to move it from the Available list box to the Selected list box. c. For the BUSINESS PHONE field, click under Definition and select Phone. d. Click Preserve null values. The Standardization Properties window should resemble the following: e. Click OK to close the Standardization Properties window. 16. Establish an appropriate description for the Standardization node. a. If necessary, click the Standardization node in the job diagram. b. Verify that the Details pane is displayed. c. If necessary, click the Basic Settings tab. d. Type Standardization as the value for the Description field. 4.4 Entity Resolution Jobs 17. Add a Clustering node to the job flow. a. In the Nodes resource pane, click b. Click to collapse the Quality grouping of nodes. in front of the Entity Resolution grouping of nodes. c. Double-click the Clustering node. The node appears in the data flow. The Clustering Properties window appears. 18. Specify properties for the Clustering node. a. Type Cluster on Three Conditions in the Name field. b. Type Cluster_ID in the Output cluster ID field field. c. Click Options. 1) Click Condition matched field prefix. 2) Type COND_ in the Condition matched field prefix field. 3) Click OK. d. Click Sort output by cluster number. e. Verify that All clusters is selected in the Cluster area. The clustering properties at this point should resemble the following: f. Specify the first condition. 1) Double-click Name_MatchCode to move from Available fields to Selected fields. 4-101 4-102 Chapter 4 ACT 2) Double-click COMPANY_MatchCode to move from Available fields to Selected fields. 3) Double-click ADDRESS_MatchCode to move from Available fields to Selected fields. 4) Double-click CITY_MatchCode to move from Available fields to Selected fields. 5) Double-click STATE/PROVINCE_MatchCode to move from Available fields to Selected fields. g. Specify the second condition. 1) Click OR. 2) Double-click Name_MatchCode to move from Available fields to Selected fields. 3) Double-click COMPANY_MatchCode to move from Available fields to Selected fields. 4) Double-click ADDRESS_MatchCode to move from Available fields to Selected fields. 5) Double-click ZIP/POSTAL CODE_MatchCode to move from Available fields to Selected fields. h. Specify the third condition. 1) Click OR. 2) Double-click Name_MatchCode to move from Available fields to Selected fields. 3) Double-click COMPANY_MatchCode to move from Available fields to Selected fields. 4) Double-click BUSINESS PHONE_MatchCode to move from Available fields to Selected fields. The final set of conditions should resemble the following: i. Click OK to close the Clustering Properties window. 19. Establish an appropriate description and preview the Match Codes node. a. If necessary, click the Clustering node in the job diagram. b. Verify that the Details pane is displayed. c. If necessary, click the Basic Settings tab. 4.4 Entity Resolution Jobs 4-103 d. Type Clustering as the value for the Description field. e. Click the Clustering node in the data flow diagram. f. Click the Preview tool ( ). A sample of records appears on the Preview tab of the Details panel. g. Scroll to the right to view the Cluster_ID and COND_ fields. h. Scroll down through the observations. Notice that some records have duplicate values for Cluster_ID. The COND_ fields display true if that condition (1, 2, or 3) is met to form the cluster. Otherwise, the COND_ fields display false. 20. Add a Match Report node to the job flow. a. In the Nodes resource pane, click b. Click to collapse the Entity Resolution grouping of nodes. in front of the Data Outputs grouping of nodes. c. Double-click the Match Report node. The node appears in the data flow. The Match Report Properties window appears. 21. Specify properties for the Match Report node. a. Type Customers Match Report in the Name field. b. Click next to the Report file field. 1) Navigate to S:\Workshop\dqdmp1\Demos\files\output_files. 2) Type Ch4D7_Customers_MatchReport in the File name field. 3) Click Save. c. Type Customers Match Report – Three Conditions in the Report Title field. d. Click Launch Viewer after job is completed. 4-104 Chapter 4 ACT e. Click next to Cluster field and select Cluster_ID. At this point, the Match Report Properties window should resemble the following: f. Specify report fields. 1) Click to move all fields from the Selected list box. 2) Double-click Cluster_ID to move this field from the Available list box to the Selected list box. 3) Double-click ID to move this field from the Available list box to the Selected list box. 4) Double-click COMPANY to move this field from the Available list box to the Selected list box. 5) Double-click LAST NAME to move this field from the Available list box to the Selected list box. 6) Double-click FIRST NAME to move this field from the Available list box to the Selected list box. 7) Double-click ADDRESS to move this field from the Available list box to the Selected list box. 8) Double-click CITY to move this field from the Available list box to the Selected list box. 9) Double-click STATE/PROVINCE to move this field from the Available list box to the Selected list box. 10) Double-click ZIP/POSTAL CODE to move this field from the Available list box to the Selected list box. 11) Double-click BUSINESS PHONE_Stnd to move this field from the Available list box to the Selected list box. 4.4 Entity Resolution Jobs The final set of selected fields should resemble the following: g. Click OK to close the Match Report Properties window. 22. Establish an appropriate description for the Match Report node. a. If necessary, click the Match Report node in the job diagram. b. Verify that the Details pane is displayed. c. If necessary, click the Basic Settings tab. d. Type Match Report as the value for the Description field. The final job flow should resemble the following: The above diagram was redrawn to facilitate viewing the display. 23. Save the job. a. Click the Data Flow tab. b. Select File Save. 4-105 4-106 Chapter 4 ACT 24. Run the job. a. Verify that the Data Flow tab is selected. b. Select Actions Run Data Job. c. Verify that the dfReport Viewer appears. Because you requested all clusters, you get many “clusters” with one record. Use these tools to scroll through the pages of clusters: . QUESTION: What is the largest number of records in a cluster? Answer: Three (3) d. Select File Exit to close the dfReport Viewer window. 4.4 Entity Resolution Jobs Notice the processing information on each node. 25. View the log information. a. Click the Log tab. b. Review the information for each of the nodes. 26. Click the Data Flow tab. 27. Select File Close. 4-107 4-108 Chapter 4 ACT Exercises 4. Creating a Data Job that Generates a Match Report Create a data job that works with the Surviving Record Identification node. The final job flow should resemble the following: • Create a new data job named Ch4E4_Manufacturers_MatchReport in the batch_jobs folder of the Basics Exercises repository. • Add the MANUFACTURERS table in the dfConglomerate Grocery data connection as the data source. Select the following fields: ID MANUFACTURER CONTACT CONTACT_ADDRESS CONTACT_CITY CONTACT_STATE_PROV CONTACT_POSTAL_CD CONTACT_CNTRY CONTACT_PHONE POSTDATE • Use the specified definitions and sensitivities. Accept the default names for the match code fields. Be sure to generate null match codes for blank field values and to preserve null values. Generate match codes for the following fields: Field Name Definition Sensitivity MANUFACTURER Organization 75 CONTACT Name 75 CONTACT_ADDRESS Address 85 CONTACT_STATE_PROV State/Province 85 CONTACT_POSTAL_CD Postal Code 85 CONTACT_PHONE Phone 95 4.4 Entity Resolution Jobs • 4-109 Cluster similar records using the following two conditions. Create a cluster field named Cluster_ID. The output must contain all clusters and be sorted by cluster number. MANUFACTURER_MatchCode CONTACT_MatchCode CONTACT_ADDRESS_MatchCode CONTACT_POSTAL_CD_MatchCode OR MANUFACTURER_MatchCode CONTACT_MatchCode CONTACT_STATE_PROV_MatchCode CONTACT_PHONE_MatchCode • Create a match report to display the cluster groupings. The match report should be opened automatically after the job is run. Name the match report Ch4E4_Manufacturers_MatchReport.mre and store it in the directory S:\Workshop\dqdmp1\Exercises\files\output_files. Display the following fields for the report: ID MANUFACTURER CONTACT CONTACT_ADDRESS CONTACT_CITY CONTACT_STATE_PROV CONTACT_POSTAL_CD CONTACT_CNTRY CONTACT_PHONE • Save and run the data job. Review the resulting match report and log. 4-110 Chapter 4 ACT Surviving Record Identification – Record Rules Defined Record-level Rules: 1. FIRST NAME = Highest Occurrence 2. ID = Maximum Value 63 Record-level rules select which record from a grouping should survive. If there is ambiguity about which record is the survivor, the tool selects the first remaining record in the grouping. Surviving Record Identification – Field Rules Defined Field-level Rules: 1. Where EMAIL is Not Null, Select EMAIL 2. Where STATE is Shortest Value, Select CITY and STATE 64 Field rules are used to “steal” information from other records in the grouping. 4.4 Entity Resolution Jobs 4-111 Entity Resolution File 65 6 5 Entity resolution is the process of merging duplicate records in a single file or multiple files so that records referring to the same physical object are treated as a single record. Records are matched based on the information that they have in common. The records that are merged might appear to be different but can actually refer to the same person or item. For example, a record for John Q. Smith at 220 Academy Street might be the same person as J. Q. Smith at the same address. The Entity Resolution File enables you to manually review the merged records and make adjustments as necessary. This can involve the following tasks: • Examining clusters • Reviewing the Cluster Analysis section • Reviewing related clusters • Processing cluster records 4-112 Chapter 4 ACT Creating an Entity Resolution Job In this demonstration, you investigate various settings for the Surviving Record Identification node. In addition, you explore the Entity Resolution File Output node. 1. If necessary, select Start All Programs DataFlux Data Management Studio 2.2. 2. Verify that the Home tab is selected. 3. Click the Folders riser bar. 4. Copy an existing job to use as a starting point. a. Click in front of the Basics Demos repository. b. Click in front of the batch_jobs folder. c. Right-click the Ch4D7_ClusterRecords_Example job and select Copy. d. Right-click the batch_jobs folder and select Paste. A message window appears. e. Click Rename the new item (Copy of …, Copy (2) of …, etc.). f. Click OK. 5. Rename the copied job. a. Right-click Copy of Ch4D7_ClusterRecords_Example and select Rename. b. Type Ch4D8_SRI_Example. c. Press ENTER. 6. Double-click Ch4D8_SRI_Example. The data job opens on a new tab. 4.4 Entity Resolution Jobs 7. Right-click the match report node (labeled Customers Match Report) and select Delete. 8. Add a Surviving Record Identification node to the job flow. a. In the Nodes resource pane, click in front of the Entity Resolution grouping of nodes. b. Double-click the Surviving Record Identification node. The node appears in the data flow. The Surviving Record Identification Properties window appears. 9. Specify properties for the Surviving Record Identification node. a. Type Select Best Record in the Name field. b. Click next to the Cluster ID field and select Cluster_ID. c. Add one record rule. 1) Click Add in the Record rules area. 2) Click next to Field and select ID. 3) Click in the Operation field and select Maximum Value. 4) Click Add Condition. The Expression area is updated to the following: 5) Click OK to close the Add Record Rule Expression window. d. Click OK to close the Surviving Record Identification Properties window. 10. Establish an appropriate description for the Surviving Record Identification node. a. If necessary, click the Select Best Record node in the job diagram. b. Verify that the Details pane is displayed. c. If necessary, click the Basic Settings tab. d. Type Surv Rec Id as the value for the Description field. 4-113 4-114 Chapter 4 ACT 11. Add a (temporary) Text File Output node to demonstrate surviving record options. a. In the Nodes resource pane, click b. Click to collapse the Entity Resolution grouping of nodes. in front of the Data Outputs grouping of nodes. c. Double-click the Text File Output node. An instance of the node appears in the data flow. The Text File Output Properties window appears. 12. Specify properties for the Text File Output node. a. Type Test in the Name field. b. Specify the output file information. 1) Click next to the Output file field. 2) Navigate to S:\Workshop\dqdmp1\Demos\files\output_files. 3) Type Test.txt in the File name field. 4) Click Save. c. Specify attributes for the file. 1) Verify that the Text qualifier field is set to “ (double quotation mark). 2) Verify that the Field delimiter field is set to Comma. 3) Click Include header row. d. Move the Cluster_ID field so that it follows the ID field. 1) Scroll in the Selected list box and locate the Cluster_ID field. 2) Click until Cluster_ID follows ID field. e. Click OK to save the changes and close the Text File Output Properties window. 13. Establish an appropriate description for the Text File Output node. a. If necessary, click the Text File Output node in the job diagram. b. Verify that the Details pane is displayed. c. If necessary, click the Basic Settings tab. d. Type Text File Output as the value for the Description field. 14. Save the job. a. Click the Data Flow tab. b. Select File Save. 15. Run the job. a. Verify that the Data Flow tab is selected. 4.4 Entity Resolution Jobs 4-115 b. Select Actions Run Data Job. The processing information appears on each of the nodes. Notice that the Surviving Record Identification node read 63 rows. For each multi-row cluster, only one record was selected (the record with the maximum ID value). Therefore, the number of records written to the text file is 57 rows. Selecting only one record from each cluster is the default action. 16. Edit the properties of the Surviving Record Identification node. a. Right-click on the Select Best Record node and select Properties. b. Click Options to the right of the Cluster ID field. 1) Click Keep duplicate records. 2) Type SR_Flag in the Surviving record ID field field. 3) Click OK to close the Options window. c. Click OK to close the Surviving Record Identification Properties window. 17. Edit the properties of the Text File Output node. a. Right-click the Text File Output node and select Properties. b. Click Display file after job runs. c. Locate the SR_Flag field in the Available list box under the Output fields area. d. Click to move the SR_Flag field to the Selected list box. 4-116 Chapter 4 ACT e. Click the SR_Flag field in the Selected list box. f. Click to move the SR_Flag up in the list box to follow the Cluster_ID field. g. Click OK to close the Text File Output Properties window. 18. Save the job. a. Click the Data Flow tab. b. Select File Save. 19. Run the job. a. Verify that the Data Flow tab is selected. b. Select Actions Run Data Job. The text file appears. c. Select File Exit to close the Notepad window. 4.4 Entity Resolution Jobs 4-117 The processing information appears on each of the nodes. Notice that the text file output node wrote the entire 63 records. The text file shows that the surviving record in a cluster has true in the SR_Flag field and the duplicate records (non-surviving) have false in the SR_Flag field. 20. Edit the properties of the Surviving Record Identification node. a. Right-click the Select Best Record node and select Properties. b. Click Options to the right of the Cluster ID field. 1) Click Generate distinct surviving record. 2) Click next to Primary key field and select ID. 3) Click OK to close the Options window. c. Click OK to close the Surviving Record Identification Properties window. 4-118 Chapter 4 ACT 21. Save the job. a. Click the Data Flow tab. b. Select File Save. 22. Run the job. a. Verify that the Data Flow tab is selected. b. Select Actions Run Data Job. The text file appears. c. Select File Exit to close the Notepad window. 4.4 Entity Resolution Jobs 4-119 The processing information appears on each of the nodes. There is now an extra record for each cluster. The text file shows that the surviving record in a cluster has true in the SR_Flag field and the duplicate records (non-surviving) have false in the SR_Flag field. If a field (such as ID in this example) is not chosen as the Primary key field, then the ID retains its value across all records in a cluster. Thus, an Options specification such as the one shown below: … produces output such as the following: 4-120 Chapter 4 ACT 23. Edit the properties of the Surviving Record Identification node. a. Right-click on the Select Best Record node and select Properties. b. Click Options to the right of the Cluster ID field. 1) Click Use primary key as surviving record ID. 2) Click next to Primary key field and select ID. 3) Click OK to close the Options window. c. Click OK to close the Surviving Record Identification Properties window. 24. Save the job. a. Click the Data Flow tab. b. Select File Save. 25. Run the job. a. Verify that the Data Flow tab is selected. b. Select Actions Run Data Job. 4.4 Entity Resolution Jobs 4-121 The text file appears. c. Select File Exit to close the Notepad window. The processing information appears on each of the nodes. The text file shows that the surviving record in a cluster has a null value in the SR_Flag field. The duplicate records (non-surviving) have the value of the primary key value of the surviving record in the SR_Flag field. 4-122 Chapter 4 ACT 26. Update the properties for the Surviving Record Identification node. a. Right-click the Surviving Record Identification node and select Properties. b. Click Field Rules in the Output fields area on the lower right. c. Add the first of two field rules. 1) Click Add in the Field Rules window. 2) Click Add in the Rule expressions area of the Add Field rule window. 3) Click next to Field and select EMAIL. 4) Click in the Operation field and select Is Not Null. 5) Click Add Condition. The Expression area is updated to the following: 6) Click OK to close the Add Field Rule Expression window. 7) Click OK to close the Add Field Rule window. d. Add the second of two field rules. 1) Click Add in the Field Rules window. 2) Click Add in the Rule expressions area of the Add Field rule window. 3) Click next to Field and select JOB TITLE. 4) Click in the Operation field and select Is Not Null. 5) Click Add Condition. 6) Click AND. 7) Click next to Field and select JOB TITLE. 8) Click in the Operation field and select Longest value. 9) Click Add Condition. 10) Click OK to close the Add Field Rule Expression window. 4.4 Entity Resolution Jobs 4-123 11) In the Affected fields area, double-click the NOTES field to move it from the Available list box to the Selected list box. 12) Click OK to close the Add Field Rule window. 13) Click OK to close the Field Rules window. e. Click OK to close the Surviving Record Identification Properties window. 27. Edit the properties of the Text File Output node. a. Right-click the Text File Output node and select Properties. b. Click the EMAIL field in the Selected list box. c. Click to move the EMAIL up in the list box to follow the SR_Flag field. d. Click the JOB TITLE field in the Selected list box. e. Click f. to move the JOB TITLE up in the list box to follow the EMAIL field. Click OK to close the Text File Output Properties window. 28. Save the job. a. Click the Data Flow tab. b. Select File Save. 4-124 Chapter 4 ACT 29. Run the job. a. Verify that the Data Flow tab is selected. b. Select Actions Run Data Job. The text file appears. 4.4 Entity Resolution Jobs 4-125 Without the field rules, the output would be the following: With the field rules in place, important information that is potentially spread across multiple records of a cluster can be retrieved for the surviving record. In this case, with the field rules in place, the surviving record with ID=63 has an e-mail address as well as a job title. 30. Right-click the Text File Output node (labeled Test) and select Delete. 31. Select Edit Clear Run Results Clear All Run Results. 4-126 Chapter 4 ACT 32. Add an Entity Resolution File Output node to the job flow. a. If necessary, click in front of the Data Outputs grouping of nodes. b. Double-click the Entity Resolution File Output node. The node appears in the data flow. The Entity Resolution File Output Properties window appears. 33. Specify properties for the Entity Resolution File Output node. a. Type ER File in the Name field. b. Click next to Cluster ID field and select Cluster_ID. c. Verify that the Source table field has a value of Customers. d. Click next to the Output file field. 1) Type Ch4D8_EntityResolutionResults in the Name field. 2) Navigate to the Basics Demos output_files folder. 3) Click Save. e. Click Display file after job runs. f. In the Primary keys area, double-click ID from the Available fields list box to move it to the Selected fields list box. g. Click Options. 1) Click Display single-record clusters in the output file. 2) Click next to Surviving record ID field and select SR_Flag. 3) Click Surviving record ID field contains primary key value. 4) Click OK to close the Options window. h. Click Edit Field Settings. 4.4 Entity Resolution Jobs 4-127 1) Double-click EMAIL to move it from the Available fields list box to the Selected fields list box. 2) Double-click JOB TITLE to move it from the Available fields list box to the Selected fields list box. 3) Double-click NOTES to move it from the Available fields list box to the Selected fields list box. 4) Click OK to close the Edit Field Settings window. i. You want to select fields affected by the field rules. Click Target Settings. 1) Click Flat file. a) Click next to the File name field. (1) Navigate to S:\Workshop\dqdmp1\Demos\files\output_files. (2) Type Ch4D8_Demo_ERF_Target.txt in the File name field. (3) Click Save. b) Verify that Comma is the value for the Delimiter field. c) Select “ (double quotation mark) as the value for the Text field. 2) Click Flag duplicate records for deletion. 3) Click under the Delete Flag field and select SR_Flag. 4) Click next to the Audit file name field. a) Navigate to S:\Workshop\dqdmp1\Demos\files\output_files. b) Type Ch4D8_Demo_ERF_Audit.txt in the File name field. c) Click Save. 5) Click OK to close the Target Settings window. j. Click OK to close the Entity Resolution File Output Properties window. 34. Save the job. a. Click the Data Flow tab. b. Select File Save. 35. Run the job. a. Verify that the Data Flow tab is selected. b. Select Actions Run Data Job. 4-128 Chapter 4 ACT c. Verify that the Ch4D8_EntityResolutionResults tab appears. 4.4 Entity Resolution Jobs 36. Review the entity resolution file. Use the Entity Resolution Viewer. a. Verify that the Cluster tab is selected. A list of the clusters contained in the entity resolution file is displayed. b. If necessary, click (the last tool on the toolbar) to reveal the Cluster Analysis pane. c. Move the cursor over the bar where the record count equals 1. Tooltip text appears and states that 53 clusters consist of one record. 4-129 4-130 Chapter 4 ACT d. Move the cursor over the bar where the record count equals 2. Tooltip text appears and states that two clusters consist of two records. e. Move the cursor over the bar where the record count equals 3. Tooltip text appears and states that two clusters consist of three records. f. Click the bar with a record count of 3 (three). The Cluster Details pane is suppressed and the two clusters with a record count of three are highlighted in the cluster listing. g. Click Cluster 22 in the cluster listing. h. If necessary, select View Show Cluster Details Pane. 4.4 Entity Resolution Jobs 4-131 The top row shows the surviving record based on the rules specified in the Surviving Record Identification node. • Maximum value of ID. • EMAIL is not null. • JOB TITLE is not null AND Longest Value of JOB TITLE. i. Click (on Cluster tab). In this Edit mode, examining the Cluster records area shows that the selected record and the surviving record area have values from the surviving record that are overwritten according to field rule specifications. 1) Double-click the value for the Address field of the Surviving record. This action makes the field editable. 2) Replace the current value of 333 Christian Street with the value 333 Christian St. 3) Press ENTER. Notice that the edited value is bold. 4) Click . 4-132 Chapter 4 ACT A message window appears. 5) Click Yes. Cluster 22 now appears as dimmed and is no longer editable. j. Type 35 in the Go to cluster field. k. Click . 1) Double-click the value for the BUSINESS PHONE field of the Surviving record. 2) Replace the current value of (718)555-0100 with the value 718-555-0100. 3) Press ENTER. 4) Click . 5) Click Yes in the message window. Cluster 35 now appears as dimmed and is no longer editable. These actions of reviewing each of the clusters and possibly making corrections can be continued for each of the clusters, including the single record clusters. 4.4 Entity Resolution Jobs l. 4-133 Click the Clusters thread to stop editing. The Cluster tab appears. A check mark appears next to the two clusters where changes are applied. m. Select File Close Entity Resolution File. n. Review the audit file that was produced. 1) Select Start All Programs Accessories Windows Explorer to access a Windows Explorer window. 2) Navigate to S:\Workshop\dqdmp1\Demos\files\output_files. 3) Double-click Ch4D8_Demo_ERF_Audit.txt. 4-134 Chapter 4 ACT A Notepad window appears with the results of the cluster edits. 4) Select File Exit to close the Notepad window. o. Review the target file that was produced. 1) Access the Windows Explorer window. 2) If necessary, navigate to S:\Workshop\dqdmp1\Demos\files\output_files. 3) Double-click Ch4D8_Demo_ERF_Target.txt. The information in the target file consists of the non-surviving records from the two clusters that were edited. 4) Select File Exit to close the Notepad window. 37. Select File Close. 4.4 Entity Resolution Jobs 4-135 Exercises 5. Selecting the Best Record Create a data job that works with the Surviving Record Identification node. The final job flow should resemble the following: • Copy the data job Ch4E4_Manufacturers_MatchReport located in the batch_jobs folder of the Basics Exercises repository. Paste it in the batch_jobs folder of the Basics Exercises. Rename it Ch4E5_Manufacturers_SelectBestRecord. • Remove the Match Report node and add a Surviving Record Identification node with the following properties: Name: Select Best Record Cluster ID field: Cluster_ID Options: Keep duplicate records. Specify SR_ID in the Surviving record ID field. Use the primary key as the surviving record ID. Specify the ID in the Primary key field. Record rule: Maximum value of POSTDATE Field rule: Highest occurrence of CONTACT 4-136 • Chapter 4 ACT Add a text file output node with the following properties: Name: Write Results Output file: Ch4E5_Manufacturer_Best_Record.txt in the directory S:\Workshop\dqdmp1\Exercises\files\output_files File attributes: “ (double-quotation mark) as the text qualifier Comma as the field delimiter Include header row Display file after job runs Output fields: Cluster_ID ID SR_ID POSTDATE MANUFACTURER CONTACT CONTACT_ADDRESS CONTACT_CITY CONTACT_STATE_PROV CONTACT_POSTAL_CD CONTACT_CNTRY CONTACT_PHONE • Save and run the data job. • Examine the output text file. Do the results make sense? 4.4 Entity Resolution Jobs 4-137 Cluster Difference 73 7 3 The Cluster Diff node is used to compare two sets of clustered records by reading in data from a left and a right table (side). From each side, the Cluster Diff node takes two inputs: a numeric record ID field and a cluster number field. Possible values for diff types include the following: COMBINE A record belongs to a set of records from one or more clusters in the left table that are combined into a larger cluster in the right table. DIVIDE A record belongs to a set of records in a cluster in the left table that is divided into two or more clusters in the right table. NETWORK A record belongs to a set of records that are involved in one or more different multirecord clusters in the left and right tables. . A record belongs to a set of records that are in the same cluster in both the left and right tables. 4-138 Chapter 4 ACT Creating a Data Job to Compare Clusters (Optional) In this demonstration, you investigate the Cluster Diff node and the results that it can produce. 1. If necessary, select Start All Programs DataFlux Data Management Studio 2.2. 2. Verify that the Home tab is selected. 3. Click the Folders riser bar. 4. Copy an existing job to use as a starting point. a. Click in front of the Basics Demos repository. b. Click in front of the batch_jobs folder. c. Right-click the Ch4D7_ClusterRecords_Example job and select Copy. d. Right-click the batch_jobs folder and select Paste. A message window appears. e. Click Rename the new item (Copy of …, Copy (2) of …, etc.). f. Click OK. 5. Rename the copied job. a. Right-click Copy of Ch4D7_ClusterRecords_Example and select Rename. b. Type Ch4D9_ClusterDiff_Example. c. Press ENTER. 6. Double-click Ch4D9_ClusterDiff_Example. The data job opens on a new tab. 4.4 Entity Resolution Jobs 4-139 7. Remove the last three nodes. a. Right-click the Standardization node (labeled Standardize Phone) and select Delete. b. Right-click the Clustering node (labeled Cluster on Three Conditions) and select Delete. c. Right-click the Match Report node (labeled Customers Match Report) and select Delete. 8. Update the Match Codes node. a. Right-click the Match Codes node and select Properties. b. Double-click the EMAIL field to move it from Available to Selected. c. Select E-mail match code definition. d. Verify the 85 sensitivity is specified. e. Click OK to close the Match Codes Properties window. 9. Add a Branch node to the job flow. a. In the Nodes resource pane, click in front of the Utilities grouping of nodes. b. Drag the Branch node to the Data Flow tab. The Branch node appears on the job flow but is not connected to any node. c. Establish node connections and a description for the Branch node. 1) Click the new Branch node in the data flow. 2) Verify that the Details pane is displayed. 3) If necessary, click the Basic Settings tab. 4) Type Branch as the value for the Name field. 5) Type Branch as the value for the Description field. 6) Click the Node Connections tab. . The Add Nodes window appears. 7) In the Connect from area, click 8) Click the Match Codes node (with a name of Match Codes for Various Fields). 9) Click OK to close the Add Nodes window. 10. Add a Clustering node to the job flow. a. Click the Branch node in the job flow. b. In the Nodes resource pane, click c. Click to collapse the Utilities grouping of nodes. in front of the Entity Resolution grouping of nodes. d. Double-click the Clustering node. The node appears in the job flow connected from the Branch node. 11. Specify properties for the new Clustering node. 4-140 Chapter 4 ACT a. Type Clustering Using Email in the Name field. b. Type Cluster_EMAIL in the Output cluster ID field field. c. Verify that Sort output by cluster number is not selected. d. Verify that All clusters is selected in the Cluster area. e. Specify the one condition. 1) Double-click Name_MatchCode to move from the Available fields list box to the Selected fields list box. 2) Double-click COMPANY_MatchCode to move from the Available fields list box to the Selected fields list box. 3) Double-click EMAIL_MatchCode to move from the Available fields list box to the Selected fields list box. f. Click OK to close the Clustering Properties window. g. Establish a description for the Clustering node. 1) Click the new Clustering node in the data flow. 2) Verify that the Details pane is displayed. 3) If necessary, click the Basic Settings tab. 4) Type Clustering as the value for the Description field. 12. Add a second Clustering node to the job flow. a. Click the Branch node in the job flow. b. If necessary, click in front of the Entity Resolution grouping of nodes. c. Double-click the Clustering node. The node appears in the job flow connected from the Branch node. 13. Specify properties for the new Clustering node. a. Type Clustering Using Address Information in the Name field. b. Type Cluster_Address in the Output cluster ID field field. c. Verify that Sort output by cluster number is not selected. d. Verify that All clusters is selected in the Cluster area. e. Specify two conditions. 1) Double-click Name_MatchCode (to move it from Available to Selected). 2) Double-click COMPANY_MatchCode (to move it from Available to Selected). 3) Double-click ADDRESS_MatchCode (to move it from Available to Selected). 4.4 Entity Resolution Jobs 4-141 4) Double-click STATE/PROVINCE_MatchCode (to move it from Available to Selected). 5) Click OR. 6) Double-click Name_MatchCode (to move it from Available to Selected). 7) Double-click COMPANY_MatchCode (to move it from Available to Selected). 8) Double-click ADDRESS_MatchCode (to move it from Available to Selected). 9) Double-click ZIP/POSTAL CODE_MatchCode (to move it from Available to Selected). The conditions should resemble the following: f. Click OK to close the Clustering Properties window. g. Establish a description for the Clustering node. 1) Click the new Clustering node in the data flow. 2) Verify that the Details pane is displayed. 3) If necessary, click the Basic Settings tab. 4) Type Clustering as the value for the Description field. The job flow should now resemble the following: 4-142 Chapter 4 ACT 14. Add a Cluster Diff node to the job flow. a. In the Nodes resource pane, verify that the Entity Resolution grouping of nodes is expanded. b. Right-click the Cluster Diff node and select Insert Node. c. Position the new Cluster Diff node to the right of and between the two clustering nodes. d. Establish node connections and a description for the Cluster Diff node. 1) Click the new Cluster Diff node in the data flow. 2) Verify that the Details pane is displayed. 3) If necessary, click the Basic Settings tab. 4) Type Cluster Diff as the value for the Name field. 5) Type Cluster Diff as the value for the Description field. 6) Click the Node Connections tab. 7) In the Connect from area, click Slot 0. Then click The Add Nodes window appears. . 8) Click the Clustering Using Email node. 9) Click OK to close the Add Nodes window. 10) In the Connect from area, click Slot 1. Then click The Add Nodes window appears. 11) Click the Clustering Using Address Information node. 12) Click OK to close the Add Nodes window. . 4.4 Entity Resolution Jobs The connections to the Cluster Diff node should now be defined. 15. Specify properties for the new Cluster Diff node. a. Right-click the Cluster Diff node and select Properties. b. Type Cluster Diff in the Name field. c. For the Left table, click under Record ID and select ID. d. For the Left table, click under Cluster number and select Cluster_EMAIL. e. For the Right table, click under Record ID and select ID. f. under Cluster number and select Cluster_Address. For the Right table, click g. Type Diff_Set in the Diff set field in the Output field area. h. Type Diff_Type in the Diff type field in the Output field area. i. Verify that Skip rows with “same” diff type is selected. j. Click Additional Outputs. 1) Click to move all fields to the Output fields list box. 2) Click OK to close the Additional Outputs window. k. Click OK to close the Cluster Diff Properties window. 16. Add an HTML Report node to the job flow. a. Click the Cluster Diff node in the job flow. b. In the Nodes resource pane, click c. Click to collapse the Entity Resolution grouping of nodes. in front of the Data Outputs grouping of nodes. d. Double-click the HTML Report node. 4-143 4-144 Chapter 4 ACT The node appears in the job flow. 17. Specify properties for the new HTML Report node. a. Type HTML Report in the Name field. b. Type Customer Cluster Difference Report in the Report title field. c. Specify the output file information. 1) Click next to the Output file field. 2) Navigate to S:\Workshop\dqdmp1\Demos\files\output_files. 3) Type Ch4D9_ClusterDiffReport.html in the File name field. 4) Click Save. d. Click Display report in browser after job run. e. Click f. to move all fields to the Selected list box. If necessary, rearrange the selected fields so that the following fields appear first: Diff_Set Diff_Type ID Cluster_EMAIL Cluster_Address g. Click OK to close the HTML Report Properties window. The final job flow should resemble the following: 4.4 Entity Resolution Jobs 18. Save the job. a. Click the Data Flow tab. b. Select File Save. 19. Run the job. a. Verify that the Data Flow tab is selected. b. Select Actions Run Data Job. The HTML file appears. 20. Close the browser when you are finished viewing the HTML report. 4-145 4-146 Chapter 4 ACT 21. Review the log. 22. Select File Close. 4.5 Multi-Input/Multi-Output Data Jobs (Self-Study) 4.5 Multi-Input/Multi-Output Data Jobs (Self-Study) Objectives Explore pre-built jobs. Understand how to specify and verify the node connections. Review several “new” nodes. 77 New Products Example A data job that has two inputs and two outputs is explored. This job involves three new nodes: Data Joining Branch Sequencer (Autonumber) 78 4-147 4-148 Chapter 4 ACT Customer Matches Example A data job with two inputs and two outputs is explored. The generated match codes are used to provide conditions for data joining. 79 4.5 Multi-Input/Multi-Output Data Jobs (Self-Study) 4-149 Multi-Input/Multi-Output Data Job: New Products In this demonstration, you examine a data job that takes advantage of two inputs and creates two outputs. The data job uses a Data Joining node, a Branch node, and a Sequencer (Autonumber) node. Each node’s properties are explored, as well as the data connection information. 1. If necessary, select Start All Programs DataFlux Data Management Studio 2.2. 2. Verify that the Home tab is selected. 3. Click the Folders riser bar. 4. Access an existing job. a. Click in front of the Basics Solutions repository. To establish/connect to the Basics Solutions repository (if necessary), do the following: 5. Click the Administration riser bar. a. Right-click Repository Definitions and select New Repository Definition. b. Type Basics Solutions as the value for the Name field. c. Click Browse next to Database file field and navigate to S:\Workshop\dqdmp1\solutions\solutions.rps. d. Click Browse next to Folder field (under File storage) and navigate to S:\Workshop\dqdmp1\solutions\files. e. Clear Private. f. Click OK. g. Click batch_jobs folder. h. In the main area, double-click the data job named Ch4D10_AddNewProducts. The data job diagram should resemble the following: 4-150 Chapter 4 ACT The job flow diagram accessed may be “vertical”. In the above picture the job diagram is horizontal for display purposes. Also, the job flow diagram accessed will have sticky note objects that are not displayed in the above picture. 6. Review the Data Source node. a. Right-click the Data Source node and select Properties. b. Verify that the Input table field displays the Products table from the dfConglomerate Gifts data connection. c. Verify that all fields from the Products table are selected. d. Click OK to close the Data Source Properties window. 7. Review the data in the dfConglomerate Products table. a. Click the Home tab. b. Click the Data riser bar. c. Click in front of Data Connections. d. Click in front of the dfConglomerate Gifts data connection. e. Click the Products table. The main area displays information for the Products table. f. If necessary, click the Data tab. g. Verify that there are 29 records in the Products table. 8. Review the Text File Input node. a. Click the main tab Ch4D10_AddNewProducts. b. If necessary, click the Data Flow tab. c. Right-click on Text File Input node and select Properties. d. Review the input file specifications. 4.5 Multi-Input/Multi-Output Data Jobs (Self-Study) 4-151 e. Verify that 5 (five) fields are specified. f. Click OK to close the Text File Input Properties window. 9. Preview the data from the text file input. a. Verify the Details panel is shown (if necessary, select View Show Details Pane). b. Right-click the text file input node and select Preview. c. Verify that there are 14 records in the input text file. 10. Review the connections for the Data Joining node. a. Verify the Details panel is shown (if necessary, select View Show Details Pane). b. Click the Data Joining node. c. From the Details pane, click the Node Connections tab. d. Verify that the Connect from pane has two slots - one is filled with the Data Source node and one is filled with the Text File Input node. 11. Review the Data Joining node. a. Right-click the Data Joining node and select Properties. b. Verify that the join type is set to Right. c. Verify that the join criteria (expression) is the equality of PRODUCT CODE from the Products data source and PRD_CODE from the text file input. 4-152 Chapter 4 ACT d. In the Output fields area, verify that the fields from the Products data source (the left table) have updated output names - each field ends with the text _1. e. In the Output fields area, verify that the fields from the text file input (the right table) have updated output names - each field ends with the text _2. f. The Data Joining node enables multiple inputs. The first node connected to this node is always considered the left side and the second connected node is considered the right side. Click OK to close the Data Joining Properties window. 4.5 Multi-Input/Multi-Output Data Jobs (Self-Study) 4-153 12. Review the connections for the Branch node. a. Verify the Details panel is shown (if necessary, select View Show Details Pane). b. Click the Branch node. c. From the Details pane, click the Node Connections tab. d. Verify that the Connect to pane has two slots - one is filled with a Data Validation node labeled No Match and one is filled with a Data Validation node labeled Match. 13. Review the Branch node. a. Right-click the Branch node and select Properties. The Branch node enables multiple outputs. It is typically followed by two or more Data Validation nodes to filter records down a particular path. A Branch node can have up to 32 connections. The Memory cache size field specifies the amount of memory (in Megabytes) allocated to this step. The Land all data locally before processing continues check box, when selected, ensures that all data from the data sources is placed on the local machine before the job continues to be processed. (Selecting this option can reduce the job performance.) b. Click OK to close the Branch Properties window. 4-154 Chapter 4 ACT 14. Review the Data Validation node labeled Match. a. Right-click the Match Data Validation node and select Properties. b. Verify that the expression is ID_1 is not null. Because this is a right join, if a match occurs (that is, if PRODUCT CODE equals PRD_CODE), then the ID_1 field has a value. c. Click OK to close the Data Validation Properties window. 15. Review the Data Validation node labeled No Match. a. Right-click the No Match Data Validation node and select Properties. b. Verify that the expression is ID_1 is null. Because this is a right join, if a match does not occur (that is, if PRODUCT CODE does not equal PRD_CODE), then the ID_1 field does not have a value. c. Click OK to close the Data Validation Properties window. 16. Review the Text File Output node labeled Products Matched. a. Right-click the Text File Output node and select Properties. b. Review the output file specifications. c. Verify that all input (Available) fields are selected for output. d. Click OK to close the Text File Output Properties window. 4.5 Multi-Input/Multi-Output Data Jobs (Self-Study) 17. Review the Sequencer (Autonumber) node labeled Create Unique ID. a. Right-click the Sequencer (Autonumber) node and select Properties. b. Verify that the field to be created (Field name field) is named PK. c. Verify that the Start number field is set to 91. d. Verify that the Interval field is set to 1. e. Click OK to close the Sequencer (Autonumber) Properties window. 4-155 4-156 Chapter 4 ACT 18. Review the Data Target (Insert) node. a. Right-click the Data Target (Insert) node and select Properties. b. Verify the fields from the text file (field names end with _2) are in the Selected list and are renamed to match their corresponding fields in the existing Products table. c. Click OK to close the Data Target (Insert) Properties window. 19. If necessary, save the job. a. Click the Data Flow tab. b. Select File Save. 20. Run the job. a. Verify that the Data Flow tab is selected. b. Select Actions Run Data Job. 4.5 Multi-Input/Multi-Output Data Jobs (Self-Study) The CSV file opens in Microsoft Excel. c. Select File Exit to close Microsoft Excel. (If you are prompted, do not save the changes.) 21. Click the Log tab to review the log. Verify that 8 rows were added to the Products table. 4-157 4-158 Chapter 4 ACT 22. Click the Data Flow tab. 23. Select File Close. 24. Review the new table. a. Click the Data Connections riser bar. b. Click in front of the dfConglomerate Gifts data connection. c. Click the Products table. d. If necessary, click the Data tab. e. Verify that there are 37 rows in the Products table. (Therefore, eight records were added.) 4.5 Multi-Input/Multi-Output Data Jobs (Self-Study) 4-159 Multi-Input/Multi-Output Data Job: Customer Matches In this demonstration, you examine a data job that takes advantage of two inputs and creates two outputs. The data job generates match codes on a variety of fields from each of the source then performs a data join using various conditions that involve the match codes. The matches will be written to a text file, as will the non-matches. 1. If necessary, select Start All Programs DataFlux Data Management Studio 2.2. 2. Verify that the Home tab is selected. 3. Click the Folders riser bar. 4. Copy an existing job to use as a starting point. a. Click in front of the Basics Solutions repository. b. Click the batch_jobs folder. In the main area, double-click the data job named Ch4D11_CustomerMatches. The data job diagram should resemble the following: The job flow diagram accessed may be “vertical”. In the above picture the job diagram is horizontal for display purposes. Also, the job flow diagram accessed will have sticky note objects that are not displayed in the above picture. 5. Review the Data Source node. a. Right-click the Data Source node and select Properties. b. Verify that the Input table field displays the Customers table from the dfConglomerate Gifts data connection. c. Verify that all fields from the Customers table are selected. d. Click OK to close the Data Source Properties window. 4-160 Chapter 4 ACT 6. Review the data in the dfConglomerate Customers table. a. Click the Home tab. b. Click the Data Connections riser bar. c. Click in front of the dfConglomerate Gifts data connection. d. Click the Customers table. The main area displays information for the Customers table. e. If necessary, click the Data tab. f. Notice that there are 63 records in the Customers table. 7. Review the Text File Input node. a. Click the Data Flow tab. b. Right-click the Text File Input node and select Properties. c. Review the input file specifications. d. Verify that 8 (eight) fields are specified. e. Click OK to close the Text File Input Properties window. 8. Review the Match Codes (Parsed) node. a. Right-click the Match Codes (Parsed) node and select Properties. b. Verify that Output field has a value of Name_Matchcode. 4.5 Multi-Input/Multi-Output Data Jobs (Self-Study) c. Verify that the Name definition is applied at the 85 sensitivity. d. Verify that two tokens are used to form the output match code field. e. Click OK to close the Match Codes (Parsed) Properties window. 9. Review the Match Codes node that follows the Match Codes (Parsed) node. a. Right-click the Match Codes node and select Properties. b. Verify that 6 (six) fields have match codes generated using an appropriate definition. c. Verify that all fields generate match codes at the 85 sensitivity. d. Click OK to close the Match Codes Properties window. 10. Review the Match Codes node that follows the Text File Input node. a. Right-click the Match Codes node and select Properties. b. Verify that 7 (seven) fields have match codes generated using an appropriate definition. c. Verify that all fields generate match codes at the 85 sensitivity. d. Click OK to close the Match Codes Properties window. 11. Review the Sequencer (Autonumber) node. a. Right-click the Sequencer (Autonumber) node and select Properties. b. Verify that the field to be created (Field name field) is named ID. 4-161 4-162 Chapter 4 ACT c. Verify that the Start number field is set to 1. d. Verify that the Interval field is set to 1. e. Click OK to close the Sequencer (Autonumber) Properties window. 12. Review the Data Joining node. a. Right-click the Data Joining node and select Properties. b. Verify that the Join type is set to Right. c. Verify that the join criteria involves one of four different conditions. d. In the Output fields area, verify that the fields from the Customers data source (the left table) have updated Output Names. That is, each field ends with the text _1. e. In the Output fields area, verify that the fields from the text file input (the right table) have updated Output Names. That is, each field ends with the text _2. f. Click OK to close the Data Joining Properties window. 13. Review the Data Validation node labeled Text File - Record Matches DB Record. a. Right-click the Data Validation node and select Properties. b. Verify that the expression is ID_1 is not null. Because this is a right join, if a match occurs (if at least one of the four conditions specified in the data joining is met), then the ID_1 field has a value. c. Click OK to close the Data Validation Properties window. 14. Review the Data Validation node labeled Text File - Record Does Not Match DB Record. a. Right-click this Data Validation node and select Properties. b. Verify that the expression is ID_1 is null. 4.5 Multi-Input/Multi-Output Data Jobs (Self-Study) 4-163 Because this is a right join, if a match does not occur (if at least one of the four conditions specified in the data joining is met), then the ID_1 field does not have a value. c. Click OK to close the Data Validation Properties window. 15. Review the Text File Output node labeled Matches. a. Right-click the Text File Output node and select Properties. b. Review the output file specifications. c. Verify that all input (Available) fields are selected for output. d. Click OK to close the Text File Output Properties window. 16. Review the Text File Output node labeled Non-Matches. a. Right-click the Text File Output node and select Properties. b. Review the output file specifications. c. Verify that only the input (Available) fields with names ending in _2 are selected for output. d. Click OK to close the Text File Output Properties window. 17. If necessary, save the job. a. Click the Data Flow tab. b. Select File Save. 18. Run the job. a. Verify that the Data Flow tab is selected. b. Select Actions Run Data Job. 4-164 Chapter 4 ACT The two CSV files open in Microsoft Excel. c. Select File Exit to close each of the Microsoft Excel sessions. (If you are prompted, do not save the changes.) 4.5 Multi-Input/Multi-Output Data Jobs (Self-Study) 19. Click the Log tab to review the log. 20. Select File Close. 4-165 4-166 Chapter 4 ACT Exercises 6. Creating a Multi-Input/Multi-Output Data Job In this exercise, you create a final job flow that should resemble the following: 4.5 Multi-Input/Multi-Output Data Jobs (Self-Study) 4-167 • Create a new data job named Ch4E6_MultiInputOutput in the batch_jobs folder of the Basics Exercises repository. • Add the MANUFACTURERS table in the dfConglomerate Grocery data connection as the data source. Select the following fields: ID MANUFACTURER CONTACT CONTACT_ADDRESS CONTACT_CITY CONTACT_STATE_PROV CONTACT_POSTAL_CD CONTACT_CNTRY CONTACT_PHONE POSTDATE • Standardize the following fields with the corresponding definitions. • • Field Name Definition MANUFACTURER Organization CONTACT Name CONTACT_ADDRESS Address CONTACT_STATE_PROV State/Province(Abbreviation) CONTACT_PHONE Phone Use the specified definitions and sensitivities. Accept the default names for the match code fields. Be sure to generate null match codes for blank field values and to preserve null values. Generate match codes for the following fields Field Name Definition Sensitivity MANUFACTURER_Stnd Organization 75 CONTACT_Stnd Name 75 CONTACT_ADDRESS_Stnd Address 85 CONTACT_STATE_PROV_Stnd State/Province 85 CONTACT_POSTAL_CD Postal Code 85 Also add a text input as input. The text file uses “ (double quotation mark) to qualify data, is comma-delimited, and has a header row. Text file: S:\Workshop\dqdmp1\data\Text Files\Manufacturer_Contact_List.txt DFL file: S:\Workshop\dqdmp1\data\Text Files\Manufacturer_Contact_List.dfl 4-168 • Chapter 4 ACT Add Performing Address Verification and Geocodinga Standardization node following the Text File Input node and standardize as follows: Field Name Definition COMPANY Organization NAME Name WORK_ADDRESS Address WORK_STATE State/Province(Abbreviation) WORK_PHONE Phone a. Use the specified definitions and sensitivities. Accept the default names for the match code fields. Be sure to generate null match codes for blank field values and to preserve null values. Generate match codes for the following fields: Field Name Definition Sensitivity COMPANY_Stnd Organization 75 NAME_Stnd Name 75 WORK_ADDRESS_Stnd Address 85 WORK_STATE_Stnd State/Province 85 WORK_ZIP Postal Code 85 • Create a unique identifier field in the data flowing from the text file. Name the field ID, start the values at 1 (one), and increment by 1 (one). • Add a Data Joining node to the job flow to join the data flowing from the data source with the data flowing from the text file. The join should be a right join and the join criteria can be one of the following: MANUFACTURER_Stnd_MatchCode = COMPANY_Stnd_MatchCode CONTACT_Stnd_MatchCode = NAME_Stnd_MatchCode CONTACT_ADDRESS_Stnd_MatchCode = WORK_ADDRESS_Stnd_MatchCode CONTACT_STATE_PROV_Stnd_MatchCode = WORK_STATE_Stnd_MatchCode or MANUFACTURER_Stnd_MatchCode = COMPANY_Stnd_MatchCode CONTACT_Stnd_MatchCode = NAME_Stnd_MatchCode CONTACT_ADDRESS_Stnd_MatchCode = WORK_ADDRESS_Stnd_MatchCode CONTACT_POSTAL_CD_MatchCode = WORK_ZIP_MatchCode or MANUFACTURER_Stnd_MatchCode = COMPANY_Stnd_MatchCode CONTACT_Stnd_MatchCode = NAME_Stnd_MatchCode CONTACT_PHONE_Stnd = WORK_PHONE_Stnd CONTACT_POSTAL_CD_MatchCode = WORK_ZIP_MatchCode • Add a Branch node following the Data Joining node. 4.5 Multi-Input/Multi-Output Data Jobs (Self-Study) 4-169 • Add two Data Validation nodes. One searches for records where ID_1 is not null (matches) and the other searches for records where ID_1 is null (non-matches). • Add two text file output: one to “hold” the matches and one for the non-matches. Create each file to use “ (double quotation mark) to qualify data. It should be comma-delimited. Add a header row. Add the following fields to the Matches text file: Field Name Output Name ID_1 DB_ID MANUFACTURER_1 MANUFACTURER CONTACT_1 CONTACT ID_2 TEXT_ID COMPANY_2 COMPANY NAME_2 NAME Add the following fields to the Non-matches text file: • Field Name Output Name COMPANY_2 COMPANY NAME_2 NAME WORK_ADDRESS_2 WORK_ADDRESS WORK_CITY_2 WORK_CITY WORK_STATE_2 WORK_STATE WORK_ZIP_2 WORK_ZIP WORK_PHONE_2 WORK_PHONE Save and run the job. QUESTION: How many records were written to the Matches text file? Answer: QUESTION: How many records were written to the Non-Matches text file? Answer: 4-170 Chapter 4 ACT 4.6 Solutions to Exercises 1. Create a Simple Data Job a. If necessary, select Start All Programs DataFlux Data Management Studio 2.2. b. Verify that the Home tab is selected. c. Click the Folders riser bar. d. Click the Basics Exercises repository. e. Click Data Job. 1) Double-click the batch_jobs folder. (This action makes this folder the value of the Save in field.) 2) Type Ch4E1_Breakfast_Items_OZ in the Name field. 3) Click OK. The new data job appears on a tab. f. Add the Data Source node to the Data Flow Editor. 1) Verify that the Nodes riser bar is selected in the Resource pane. 2) Click in front of the Data Inputs grouping of nodes. 3) Double-click the Data Source node. 4) Type Breakfast Items Table in the Name field. 5) Click a) Click next to the Input table field. in front of the dfConglomerate Grocery data source. b) Click BREAKFAST_ITEMS. c) Click OK to close the Select Table window. 6) Click OK to save the changes and close the Data Source Properties window. The Data Source node appears in the job diagram with updated information displayed. g. Add a Data Validation node to the job flow. 1) In the Nodes resource pane, click 2) Click to collapse the Data Inputs grouping of nodes. in front of the Utilities grouping of nodes. 3) Double-click the Data Validation node. h. Specify properties for the Data Validation node. 1) Type Filter for OZ UOM in the Name field. 4.5 Multi-Input/Multi-Output Data Jobs (Self-Study) 2) Click next to Field and select UOM. 3) Click in the Operation field and select equal to. 4) Click Single (if necessary) and then type OZ. 5) Click Add Condition. The Expression area is updated. The final settings for the Data Validation node should resemble the following: 6) Click OK to close the Data Validation Properties window. i. Preview the Data Validation node. 1) If necessary, click the Data Validation node in the job diagram. 2) Verify that the Details pane is displayed. 3) Click the Preview tool ( ). 4-171 4-172 Chapter 4 ACT 4) Scroll to the right to view the UOM field. j. Add a Text File Output node to the job flow. 1) In the Nodes resource pane, click 2) Click to collapse the Utilities grouping of nodes. in front of the Data Outputs grouping of nodes. 3) Double-click the Text File Output node. k. Specify properties for the Text File Output node. 1) Type Breakfast Items - OZ in the Name field. 2) Specify the output file information. a) Click next to the Output file field. b) Navigate to S:\Workshop\dqdmp1\Exercises\files\output_files. c) Type Ch4E1_Breakfast_Items_OZ.txt in the File name field. d) Click Save. 3) Specify attributes for the file. a) Verify that the Text qualifier field is set to “ (double quotation mark). b) Verify that the Field delimiter field is set to Comma. c) Click Include header row. d) Click Display file after job runs. 4) Re-order selected fields for the output file. a) Click MANUFACTURER_ID in the Selected area. b) Click until MANUFACTURER_ID appears before BRAND (and after ID). 4.5 Multi-Input/Multi-Output Data Jobs (Self-Study) 4-173 5) Click OK to save the changes and close the Text File Output Properties window. l. Add a note to the data flow. 1) Click the Insert Note tool ( ). 2) Click and drag the note to position it under the nodes. 3) Type the following as text for the note: This job: 1. Reads the records from the BreakfastItems table 2. Filters for the records where UOM=OZ 3. Writes the results to a text file The data job should now resemble the following: m. Specify properties for the job. 1) Click the Settings tab. 2) Type Create and run a data job in the Description field. 3) Type The job writes Breakfast Items with the Unit of Measure of OZ to a text file in the Notes field. n. Select File Save. o. Run the job. 4-174 Chapter 4 ACT 1) Verify that the Data Flow tab is selected. 2) Select Actions Run Data Job. The job runs and the requested text file output is displayed in a Notepad window. 3) Select File Exit to close the Notepad window. 4) Notice the processing information on each node. p. View the log information. 1) Click the Log tab. 2) Review the information for each of the nodes. q. Select File Close. QUESTION: How many records were selected in the filter and therefore written to the text file? Answer: 1615 rows QUESTION: How many records were read from the source table? Answer: 2617 rows 2. Creating a Data Job that Involves Standardization and Parsing a. If necessary, select Start All Programs DataFlux Data Management Studio 2.2. b. Create a new data job. 1) Verify that the Home tab is selected. 2) Click the Folders riser bar. 3) Click the Basics Exercises repository. 4) Click Data Job. 4.5 Multi-Input/Multi-Output Data Jobs (Self-Study) 4-175 5) Double-click the batch_jobs folder. (This action makes this folder the value of the Save in field.) 6) Type Ch4E2_Manufacturers_DataQuality in the Name field. 7) Click OK. c. Add the Data Source node to the Data Flow Editor. 1) Verify that the Nodes riser bar is selected in the Resource pane. 2) Click in front of the Data Inputs grouping of nodes. 3) Double-click the Data Source node. d. Specify the properties of the Data Source node. 1) Type Manufacturers Table in the Name field. 2) Click 3) Click next to the Input table field. in front of the dfConglomerate Grocery data source. 4) Click MANUFACTURERS. 5) Click OK to close the Select Table window. 6) In the Output fields area, select only the following fields: ID MANUFACTURER CONTACT CONTACT_ADDRESS CONTACT_CITY CONTACT_STATE_PROV CONTACT_POSTAL_CD CONTACT_CNTRY CONTACT_PHONE POSTDATE 7) Click OK to save the changes and close the Data Source Properties window. e. Add a Standardization node to the job flow. 1) In the Nodes resource pane, click 2) Click to collapse the Data Inputs grouping of nodes. in front of the Quality grouping of nodes. 3) Double-click the Standardization node. The node appears in the data flow. The Standardization Properties window appears. f. Specify properties for the Standardization node. 4-176 Chapter 4 ACT 1) Type Standardize Fields in the Name field. 2) Double-click each of the following fields to move them from the Available list box to the Selected list box. MANUFACTURER CONTACT CONTACT_ADDRESS CONTACT_STATE_PROV CONTACT_CNTRY 3) For the MANUFACTURER field, click Ch3D8 Company Phrase Scheme. 4) For the CONTACT field, click under Scheme and select under Definition and select Name. 5) For the CONTACT_ADDRESS field, click under Definition and select Address. 6) For the CONTACT_STATE_PROV field, click State/Province(Abbreviation). 7) For the CONTACT_CNTRY field, click Ch4E6 CONTACT_CNTRY Scheme. under Definition and select under Scheme and select 8) Click Preserve null values. 9) Click OK to close the Standardization Properties window. g. Preview the Standardization node. 1) If necessary, click the Standardization node in the job diagram. 2) Verify that the Details pane is displayed. 3) Click the Preview tool ( ). A sample of records appears on the Preview tab of the Details panel. 4) Scroll to the right to view the _Stnd fields. h. Add a Parsing node to the job flow. 1) In the Nodes resource pane, verify that the Quality grouping of nodes is expanded. 2) Double-click the Parsing node. The node appears in the data flow. The Parsing Properties window appears. i. Specify properties for the Parsing node. 1) Type Parse Contact_Stnd in the Name field. 2) Click under Field to parse and select CONTACT_Stnd. 3) Click under Definition and select Name. 4.5 Multi-Input/Multi-Output Data Jobs (Self-Study) 4-177 4) Double-click each of the following tokens to move them from the Available list box to the Selected list box. Given Name Middle Name Family Name 5) Change the output name for the Given Name token. a) Double-click the default output name for the Given Name token. b) Type FIRST_NAME. c) Press ENTER. 6) Change the output name for the Middle Name token. a) Double-click the default output name for the Middle Name token. b) Type MIDDLE_NAME. c) Press ENTER. 7) Change the output name for the Family Name token. a) Double-click the default output name for the Family Name token. b) Type LAST_NAME. c) Press ENTER. 8) Click Preserve null values. 4-178 Chapter 4 ACT The settings in the Parsing Properties window should resemble the following: 9) Click OK to close the Parsing Properties window. 4.5 Multi-Input/Multi-Output Data Jobs (Self-Study) 4-179 j. Preview the Parsing node. 1) If necessary, click the Parsing node in the job diagram. 2) Verify that the Details pane is displayed. 3) Click the Preview tool ( ). A sample of the records appears on the Preview tab of the Details panel. 4) Scroll to the right to view the new, parsed name fields. k. Add a Data Target (Insert) node to the job flow. 1) In the Nodes resource pane, click 2) Click to collapse the Quality grouping of nodes. in front of the Data Outputs grouping of nodes. 3) Double-click the Data Target (Insert) node. The node appears in the data flow. The Parsing Properties window appears. l. Specify properties for the Data Target (Insert) node. 1) Type Insert Records on Manufacturers_Stnd Table in the Name field. 2) Click a) Click next to the Output table field. The Select Table window appears. in front of the dfConglomerate Grocery data source. b) Click dfConglomerate Grocery. c) Click to be able to add a table to the selected data connection. (1) Type MANUFACTURERS_STND in the Enter a name for the new table field. (2) Click OK to close the New Table window. 4-180 Chapter 4 ACT d) Click OK to close the Select Table window. 3) Click Delete existing rows. 4) Reorder and rename the selected output fields. a) Remove all fields from the Selected list box: Each field can be removed by clicking the field in the Selected list box and then clicking . Extended selection is also an option. b) Move the following fields to the Selected side, in the following order: ID MANUFACTURER_Stnd FIRST_NAME MIDDLE_NAME LAST_NAME CONTACT_ADDRESS_Stnd CONTACT_CITY CONTACT_STATE_PROV_Stnd CONTACT_POSTAL_CD CONTACT_CNTRY_Stnd CONTACT_PHONE POSTDATE 5) Rename the _Stnd fields. Field Name Output Name ID ID MANUFACTURER_Stnd MANUFACTURER FIRST_NAME FIRST_NAME MIDDLE_NAME MIDDLE_NAME LAST_NAME LAST_NAME CONTACT_ADDRESS_Stnd CONTACT_ADDRESS CONTACT_CITY CONTACT_CITY CONTACT_STATE_PROV_Stnd CONTACT_STATE_PROV CONTACT_POSTAL_CD CONTACT_POSTAL_CD CONTACT_CNTRY_Stnd CONTACT_CNTRY CONTACT_PHONE CONTACT_PHONE POSTDATE POSTDATE 4.5 Multi-Input/Multi-Output Data Jobs (Self-Study) 4-181 6) Click OK to save the changes and close the Data Target (Insert) Properties window. m. Save the job. 1) Click the Data Flow tab. 2) Select File Save. n. Run the job. 1) Verify that the Data Flow tab is selected. 2) Select Actions Run Data Job. 3) Notice the processing information on each node. o. View the log information. 1) Click the Log tab. 2) Review the information for each of the nodes. p. Select File Close. q. View the newly created data. 1) If necessary, click the Home tab. 2) Click the Data riser bar. 3) Click in front of Data Connections. 4) Click in front of the dfConglomerate Grocery data connection. 5) Click the MANUFACTURERS_STND table object. 6) In the information area, click the Data tab. 7) Scroll through the data. 3. Performing Address Verification on the MANUFACTURERS Table a. If necessary, select Start All Programs DataFlux Data Management Studio 2.2. b. Verify that the Home tab is selected. c. Click the Folders riser bar. d. Click the Basics Exercises repository. e. Click Data Job. 1) Double-click the batch_jobs folder. (This action makes this folder the value of the Save in field.) 2) Type Ch4E3_MANUFACTURERS_Verify in the Name field. 3) Click OK. 4-182 Chapter 4 ACT f. Add the Data Source node to the Data Flow Editor. 1) Verify that the Nodes riser bar is selected in the Resource pane. 2) Click in front of the Data Inputs grouping of nodes. 3) Double-click the Data Source node. 4) Type Manufacturers Table in the Name field. 5) Click a) Click next to the Input table field. in front of the dfConglomerate Grocery data source. b) Click MANUFACTURERS. c) Click OK to close the Select Table window. 6) Click OK to save changes and close the Data Source Properties window. g. Add an Address Verification (US/Canada) node to the job flow. 1) In the Nodes resource pane, click 2) Click to collapse the Data Inputs grouping of nodes. in front of the Enrichment grouping of nodes. 3) Double-click the Address Verification (US/Canada) node. h. Specify properties for the Address Verification (US/Canada) node. 1) Type Verify US Addresses in the Name field. 2) Verify that United States is selected in the Address area. 3) Specify the Input information. a) Click under Field Type for the MANUFACTURER field and click Firm. b) Click under Field Type for the STREET_ADDR field and click Address Line 1. c) Click under Field Type for the CITY field and click City. d) Click under Field Type for the STATE_PROV field and click State. e) Click under Field Type for the POSTAL_CD field and click Zip. 4.5 Multi-Input/Multi-Output Data Jobs (Self-Study) 4-183 4) Specify options for the address verification. a) Click Options. b) Click Proper case results. c) Click Output blanks as nulls. d) Click Street abbreviation. e) Click City abbreviation. f) Clear CASS compliance. g) Verify that Insert dash between ZIP and ZIP4 is selected. h) Click OK to save the settings and close the Options window. 5) Specify the Output fields’ information. a) Double-click Firm to move this field from the Available list box to the Selected list box. b) Double-click Address Line 1 to move this field from the Available list box to the Selected list box. c) Double-click City to move this field from the Available list box to the Selected list box. d) Double-click State to move this field from the Available list box to the Selected list box. e) Double-click ZIP/Postal_Code to move this field from the Available list box to the Selected list box. f) Double-click US_County_Name to move this field from the Available list box to the Selected list box. g) Double-click US_Result_Code to move this field from the Available list box to the Selected list box. h) For the Output Name field, type _Verified after Firm. i) For the Output Name field of Address_Line_1, type Address_Verified. j) For the Output Name field, type _Verified after City. k) For the Output Name field, type _Verified after State. l) For the Output Name field of ZIP/Postal Code, type Zip_Verified. 6) Limit the original fields in output. a) Click Additional Outputs. b) Click the CONTACT field in the Output fields area, and then click . c) Click the CONTACT_ADDRESS field in the Output fields area, and then click d) Click the CONTACT_CITY field in the Output fields area, and then click . . 4-184 Chapter 4 ACT e) Click the CONTACT_STATE_PROV field in the Output fields area, and then click f) Click the CONTACT_POSTAL_CD field in the Output fields area, and then click g) Click the CONTACT_CNTRY field in the Output fields area, and then click . h) Click the CONTACT_PHONE field in the Output fields area, and then click . i) Click the NOTES field in the Output fields area, and then click . j) Click the POSTDATE field in the Output fields area, and then click . k) Click OK to close the Additional Outputs window. 7) Click OK to close the Address Verification (US/Canada) Properties window. i. Preview the Address Verification node. 1) Right-click the Address Verification node and select Preview. 2) Scroll to view the results. j. Add a Text File Output node to the job flow. 1) In the Nodes resource pane, click 2) Click to collapse the Enrichment grouping of nodes. in front of the Data Outputs grouping of nodes. 3) Double-click the Text File Output node. k. Specify properties for the Text File Output node. 1) Type Manufacturers Info – Verify in the Name field. 2) Specify the output file information. a) Click next to the Output file field. b) Navigate to S:\Workshop\dqdmp1\Exercises\files\output_files. c) Type Ch4E3_Manufacturer_Verify.txt in the File name field. d) Click Save. 3) Specify attributes for the file. a) Verify that the Text qualifier field is set to “ (double quotation mark). b) Verify that Field delimiter field is set to Comma. c) Click Include header row. d) Click Display file after job runs. 4) Specify desired fields for the output file. a) Click to move all default selected fields from the list box. . . 4.5 Multi-Input/Multi-Output Data Jobs (Self-Study) b) Specify the following fields with the specified output name: Field Name Output Name ID ID Firm_Verified Manufacturer Address_Verified Address City_Verified City State_Verified State ZIP_Verified ZIP US_County_Name US_County_Name US_Result_Code US_Result_Code 5) Export the field layout. a) Click Export. b) If necessary, navigate to S:\Workshop\dqdmp1\Exercises\files\output_files. c) Type Ch4E3_Manufacturer_Verify.dfl in the File name field. d) Click Save. 6) Click OK to save the changes and close the Text File Output Properties window. l. Save the job. 1) Click the Data Flow tab. 2) Select File Save. m. Run the job. 1) Verify that the Data Flow tab is selected. 2) Select Actions Run Data Job. n. Verify that the text file appears. o. Select File Exit to close the Notepad window. 4-185 4-186 Chapter 4 ACT Notice the processing information on each node. p. View the log information. 1) Click the Log tab. 2) Review the information for each of the nodes. q. Select File Close. 4. Creating a Data Job that Generates a Match Report a. If necessary, select Start All Programs DataFlux Data Management Studio 2.2. b. Create a new data job. 1) Verify that the Home tab is selected. 2) Click the Folders riser bar. 3) Click the Basics Exercises repository. Data Job. 4) Click 5) Double-click the batch_jobs folder. (This action makes this folder the value of the Save in field.) 6) Type Ch4E4_Manufacturers_MatchReport in the Name field. 7) Click OK. c. Add the Data Source node to the Data Flow Editor. 1) Verify that the Nodes riser bar is selected in the Resource pane. 2) Click in front of the Data Inputs grouping of nodes. 3) Double-click the Data Source node. d. Specify the properties of the Data Source node. 1) Type Manufacturers Table in the Name field. 2) Click next to the Input table field. 3) Click in front of the dfConglomerate Grocery data source. 4) Click MANUFACTURERS. 4.5 Multi-Input/Multi-Output Data Jobs (Self-Study) 4-187 5) Click OK to close the Select Table window. 6) In the Output fields area, select only the following fields: ID MANUFACTURER CONTACT CONTACT_ADDRESS CONTACT_CITY CONTACT_STATE_PROV CONTACT_POSTAL_CD CONTACT_CNTRY CONTACT_PHONE POSTDATE 7) Click OK to save the changes and close the Data Source Properties window. e. Add a Match Codes node to the job flow. 1) In the Nodes resource pane, click 2) Click to collapse the Data Inputs grouping of nodes. in front of the Entity Resolution grouping of nodes. 3) Double-click the Match Codes node. The node appears in the data flow. The Match Codes Properties window appears. f. Specify properties for the Match Codes node. 1) Type Match Codes for Various Fields in the Name field. 2) Double-click the MANUFACTURER field to move it from the Available list box to the Selected list box. 3) Click under Definition and select Organization. 4) Accept the default Sensitivity of 75. 5) Double-click the CONTACT field to move it from the Available list box to the Selected list box. 6) Click under Definition and select Name. 7) Accept the default Sensitivity of 75. 8) Double-click the CONTACT_ADDRESS field to move it from the Available list box to the Selected list box. 9) Click under Definition and select Address. 10) Accept the default Sensitivity of 85. 11) Double-click the CONTACT_STATE_PROV field to move it from the Available list box to the Selected list box. 4-188 Chapter 4 ACT 12) Click under Definition and select State/Province. 13) Accept the default Sensitivity of 85. 14) Double-click the CONTACT_POSTAL_CD field to move it from the Available list box to the Selected list box. 15) Click under Definition and select Postal Code. 16) Accept the default Sensitivity of 85. 17) Double-click the CONTACT_PHONE field to move it from the Available list box to the Selected list box. 18) Click under Definition and select Phone. 19) Accept the default Sensitivity of 95. 20) Click Generate null match codes for blank field values. 21) Click Preserve null values. 22) Click OK to close the Match Codes Properties window. g. Preview the Match Codes node. 1) If necessary, click the Match Codes node in the job diagram. 2) Verify that the Details pane is displayed. 3) Click the Preview tool ( ). A sample of the records appears on the Preview tab of the Details panel. 4) Scroll to the right to view the _MatchCode fields. h. Add a Clustering node to the job flow. 1) Verify that the Entity Resolution grouping of nodes is expanded. 2) Double-click the Clustering node. The node appears in the data flow. The Clustering Properties window appears. i. Specify properties for the Clustering node. 1) Type Cluster on Two Conditions in the Name field. 2) Type Cluster_ID in the Output cluster ID field field. 3) Click Sort output by cluster number. 4) Verify that All clusters is selected in the Cluster area. 5) Specify the first condition. a) Double-click MANUFACTURER_MatchCode to move from the Available fields list box to the Selected fields list box. 4.5 Multi-Input/Multi-Output Data Jobs (Self-Study) 4-189 b) Double-click CONTACT_MatchCode to move from the Available fields list box to the Selected fields list box. c) Double-click CONTACT_ADDRESS_MatchCode to move from the Available fields list box to the Selected fields list box. d) Double-click CONTACT_POSTAL_CD_MatchCode to move from the Available fields list box to the Selected fields list box. 6) Specify the second condition. a) Click OR. b) Double-click MANUFACTURER_MatchCode to move from the Available fields list box to the Selected fields list box. c) Double-click CONTACT_MatchCode to move from the Available fields list box to the Selected fields list box. d) Double-click CONTACT_STATE_PROV_MatchCode to move from the Available fields list box to the Selected fields list box. e) Double-click CONTACT_HONE_MatchCode to move from the Available fields list box to the Selected fields list box. 7) Click OK to close the Clustering Properties window. j. Preview the Clustering node. 1) If necessary, click the Clustering node in the job diagram. 2) Verify that the Details pane is displayed. 3) Click the Preview tool ( ). A sample of records appears on the Preview tab of the Details panel. 4) Scroll to the right to view the Cluster_ID field. 5) Scroll down through the observations. Some records have duplicate values for Cluster_ID. k. Add a Match Report node to the job flow. 1) In the Nodes resource pane, click 2) Click to collapse the Entity Resolution grouping of nodes. in front of the Data Outputs grouping of nodes. 3) Double-click the Match Report node. The node appears in the data flow. The Match Report Properties window appears. l. Specify properties for the Match Report node. 1) Type Manufacturers Match Report in the Name field. 4-190 Chapter 4 ACT 2) Click next to the Report file field. a) Navigate to S:\Workshop\dqdmp1\Exercises\files\output_files. b) Type Ch4D7_Manufacturers_MatchReport in the File name field. c) Click Save. 3) Type Manufacturers Match Report – Two Conditions in the Report Title field. 4) Click Launch Viewer after job is completed. 5) Click next to Cluster field and select Cluster_ID. 6) Specify the report fields. a) Click to move all fields from the Selected list box. b) Double-click Cluster_ID to move this field from the Available list box to the Selected list box. c) Double-click ID to move this field from the Available list box to the Selected list box. d) Double-click MANUFACTURER to move this field from the Available list box to the Selected list box. e) Double-click CONTACT to move this field from the Available list box to the Selected list box. f) Double-click CONTACT_ADDRESS to move this field from the Available list box to the Selected list box. g) Double-click CONTACT_CITY to move this field from the Available list box to the Selected list box. h) Double-click CONTACT_STATE_PROV to move this field from the Available list box to the Selected list box. i) Double-click CONTACT_POSTAL_CD to move this field from the Available list box to the Selected list box. j) Double-click CONTACT_CNTRY to move this field from the Available list box to the Selected list box. k) Double-click CONTACT_PHONE to move this field from the Available list box to the Selected list box. 7) Click OK to close the Match Report Properties window. m. Save the job. 1) Click the Data Flow tab. 2) Select File Save. n. Run the job. 4.5 Multi-Input/Multi-Output Data Jobs (Self-Study) 1) Verify that the Data Flow tab is selected. 2) Select Actions Run Data Job. 3) Verify that the dfReport Viewer appears. 4) Select File Exit to close the dfReport Viewer window. Notice the processing information on each node. o. View the log information. 1) Click the Log tab. 2) Review the information for each of the nodes. p. Click the Data Flow tab. q. Select File Close. 5. Selecting the Best Record a. Create a copy of the Ch4E4_Manufacturers_MatchReport job. 4-191 4-192 Chapter 4 ACT 1) Verify that the Home tab is selected. 2) Click the Folders riser bar. 3) Click in front of the Basics Exercises repository. 4) Click in front of the batch_jobs folder. 5) Right-click the Ch4E4_Manufacturers_MatchReport job and select Copy. 6) Right-click the batch_jobs folder and select Paste. A message window appears. a) Click Rename the new item (Copy of …, Copy (2) of …, etc.). b) Click OK. 7) Rename the copied job. a) Right-click Copy of Ch4E4_Manufacturers_MatchReport and select Rename. b) Type Ch4E5_Manufacturers_SelectBestRecord. c) Press ENTER. b. Double-click Ch4E5_Manufacturers_SelectBestRecord to open the job for editing. c. Right-click the Match Report node (labeled Manufacturers Match Report) and select Delete. d. Add a Surviving Record Identification node to the job flow. 1) In the Nodes resource pane, click in front of the Entity Resolution grouping of nodes. 2) Double-click the Surviving Record Identification node. The node in the data flow and the Surviving Record Identification Properties window appear. e. Specify properties for the Surviving Record Identification node. 1) Type Select Best Record in the Name field. 2) Click next to Cluster ID field and select Cluster_ID. 3) Click Options to the right of the Cluster ID field. a) Click Keep duplicate records. b) Type SR_ID in the Surviving record ID field field. c) Click Use primary key as surviving record ID. d) Click next to Primary key field and select ID. e) Click OK to close the Options window. 4) Add one record rule. a) Click Add in the Record rules area. 4.5 Multi-Input/Multi-Output Data Jobs (Self-Study) b) Click next to Field and select POSTDATE. c) Click in the Operation field and select Maximum Value. 4-193 d) Click Add Condition. The Expression area is updated. e) Click OK to close the Add Record Rule Expression window. 5) Click Field Rules in the Output fields area, lower right. a) Click Add in the Field Rules window. b) Click Add in the Rule expressions area of the Add Field Rule window. c) Click next to Field and select CONTACT. d) Click in the Operation field and select Highest Occurrence. e) Click Add Condition. f) Click OK to close the Add Field Rule Expression window. g) Click OK to close the Add Field Rule window. 6) Click OK to close the Surviving Record Identification Properties window. f. Add a Text File Output node. 1) In the Nodes resource pane, click 2) Click to collapse the Entity Resolution grouping of nodes. in front of the Data Outputs grouping of nodes. 3) Double-click the Text File Output node. g. Edit the properties of the Text File Output node. 1) Type Write Results in the Name field. 2) Specify the output file information. a) Click next to the Output file field. b) Navigate to S:\Workshop\dqdmp1\Exercises\files\output_files. c) Type Ch4E5_Manufacturer_Best_Record.txt in the File name field. d) Click Save. 3) Specify attributes for the file. a) Verify that the Text qualifier field is set to “ (double quotation mark). b) Verify that the Field delimiter field is set to Comma. c) Click Include header row. d) Click Display file after job runs. 4-194 Chapter 4 ACT 4) Select certain fields for the text file. a) Click to move all fields from the Selected list box. b) Double-click Cluster_ID to move this field from the Available list box to the Selected list box. c) Double-click ID to move this field from the Available list box to the Selected list box. d) Double-click SR_ID to move this field from the Available list box to the Selected list box. e) Double-click POSTDATE to move this field from the Available list box to the Selected list box. f) Double-click MANUFACTURER to move this field from the Available list box to the Selected list box. g) Double-click CONTACT to move this field from the Available list box to the Selected list box. h) Double-click CONTACT_ADDRESS to move this field from the Available list box to the Selected list box. i) Double-click CONTACT_CITY to move this field from the Available list box to the Selected list box. j) Double-click CONTACT_STATE_PROV to move this field from the Available list box to the Selected list box. k) Double-click CONTACT_POSTAL_CD to move this field from the Available list box to the Selected list box. l) Double-click CONTACT_CNTRY to move this field from the Available list box to the Selected list box. m) Double-click CONTACT_PHONE to move this field from the Available list box to the Selected list box. 5) Click OK to close the Text File Output Properties window. h. Save the job. 1) Click the Data Flow tab. 2) Select File Save. i. Run the job. 1) Verify that the Data Flow tab is selected. 2) Select Actions Run Data Job. 4.5 Multi-Input/Multi-Output Data Jobs (Self-Study) 4-195 The text file appears: Two groups are highlighted in the above view of the text file. For Cluster ID=4, the selected record is the record with ID=78. Why? For Cluster_ID=6, the selected record is the record with ID=11. Why? 3) Select File Exit to close the Notepad window. j. Select File Close. 6. Creating a Multi-Input/Multi-Output Data Job a. If necessary, select Start All Programs DataFlux Data Management Studio 2.2. b. Verify that the Home tab is selected. c. Click the Folders riser bar. d. Click the Basics Exercises repository. e. Click Data Job. 1) Double-click the batch_jobs folder. (This action makes this folder the value of the Save in field.) 2) Type Ch4E6_MultiOutput in the Name field. 3) Click OK. 4-196 Chapter 4 ACT f. Add the Data Source node to the Data Flow Editor. 1) Verify that the Nodes riser bar is selected in the resource pane. 2) Click in front of the Data Inputs grouping of nodes. 3) Double-click the Data Source node. 4) Type Manufacturers Table in the Name field. 5) Click next to the Input table field. a) Click in front of the dfConglomerate Grocery data source. b) Click MANUFACTURERS. c) Click OK to close the Select Table window. 6) In the Output fields area, select only the following fields: ID MANUFACTURER CONTACT CONTACT_ADDRESS CONTACT_CITY CONTACT_STATE_PROV CONTACT_POSTAL_CD CONTACT_CNTRY CONTACT_PHONE POSTDATE 7) Click OK to save the changes and close the Data Source Properties window. g. Add a Standardization node to the job flow. 1) In the Nodes resource pane, click 2) Click to collapse the Data Inputs grouping of nodes. in front of the Quality grouping of nodes. 3) Double-click the Standardization node. 4) Type Standardize Fields in the Name field. 5) Double-click each of the following fields to move them from the Available list box to the Selected list box. MANUFACTURER CONTACT CONTACT_ADDRESS CONTACT_STATE_PROV CONTACT_PHONE 4.5 Multi-Input/Multi-Output Data Jobs (Self-Study) 6) For the MANUFACTURER field, click 7) For the CONTACT field, click 4-197 under Definition and select Organization. under Definition and select Name. 8) For the CONTACT_ADDRESS field, click under Definition and select Address. 9) For the CONTACT_STATE_PROV field, click State/Province(Abbreviation). 10) For the CONTACT_PHONE field, click under Definition and select under Definition and select Phone. 11) Click Preserve null values. 12) Click OK to close the Standardization Properties window. h. Add a Match Codes node to the job flow. 1) In the Nodes resource pane, click 2) Click to collapse the Data Inputs grouping of nodes. in front of the Entity Resolution grouping of nodes. 3) Double-click the Match Codes node. 4) Type Generate Match Codes in the Name field. 5) Double-click the MANUFACTURER_Stnd field to move it from the Available list box to the Selected list box. 6) Click under Definition and select Organization. 7) Accept the default Sensitivity of 75. 8) Double-click the CONTACT_Stnd field to move it from the Available list box to the Selected list box. 9) Click under Definition and select Name. 10) Accept the default Sensitivity of 75. 11) Double-click the CONTACT_ADDRESS_Stnd field to move it from the Available list box to the Selected list box. 12) Click under Definition and select Address. 13) Accept the default Sensitivity of 85. 14) Double-click the CONTACT_STATE_PROV_Stnd field to move it from the Available list box to the Selected list box. 15) Click under Definition and select State/Province. 16) Accept the default Sensitivity of 85. 17) Double-click the CONTACT_POSTAL_CD field to move it from the Available list box to the Selected list box. 18) Click under Definition and select Postal Code. 4-198 Chapter 4 ACT 19) Accept the default Sensitivity of 85. 20) Click Generate null match codes for blank field values. 21) Click Preserve null values. 22) Click OK to close the Match Codes Properties window. i. Select Top To Bottom to change the layout to top to bottom. j. Add the Text File Input node to the data job. 1) Verify that the Nodes riser bar is selected in the Resource pane. 2) Click in front of the Data Inputs grouping of nodes. 3) Right-click the Text File Input node and select Insert Node. 4) Right-click the Text File Input node and select Properties. 5) Type Manufacturers Contact List in the Name field. 6) Click next to the Input file field. a) Navigate to S:\Workshop\dqdmp1\data\Text Files. b) Click Manufacturer_Contact_List.txt. c) Click Open to close the Open window. 7) Select “ (double quotation mark) as the value that the Text qualifier field is set to. 8) Verify that Comma is the value for the Field delimiter field. 9) Click Number of rows to skip. 10) Verify that 1 (one) is the default value for number of rows to skip. 11) Click Import under the Fields area. a) If necessary, navigate to S:\Workshop\dqdmp1\data\Text Files. b) Click Manufacturer_Contact_List.dfl. c) Click Open. 12) Click OK to save the changes and close the Text File Input Properties window. k. Add a Standardization node to the job flow. 1) Click in front of the Quality grouping of nodes. 2) Double-click the Standardization node. (Verify that the node was attached to the Text File Input node). 3) Type Standardize Fields in the Name field. 4.5 Multi-Input/Multi-Output Data Jobs (Self-Study) 4-199 4) Double-click each of the following fields to move them from the Available list box to the Selected list box. COMPANY NAME WORK_ADDRESS WORK_STATE WORK_PHONE 5) For the COMPANY field, click 6) For the NAME field, click under Definition and select Organization. under Definition and select Name. 7) For the WORK_ADDRESS field, click 8) For the WORK_STATE field, click State/Province(Abbreviation). 9) For the WORK_PHONE field, click under Definition and select Address. under Definition and select under Definition and select Phone. 10) Click Preserve null values. 11) Click OK to close the Standardization Properties window. l. Add a Match Codes node to the job flow. 1) Click in front of the Entity Resolution grouping of nodes. 2) Double-click the Match Codes node. 3) Type Generate Match Codes in the Name field. 4) Double-click the COMPANY_Stnd field to move it from the Available list box to the Selected list box. 5) Click under Definition and select Organization. 6) Accept the default Sensitivity of 75. 7) Double-click the NAME_Stnd field to move it from the Available list box to the Selected list box. 8) Click under Definition and select Name. 9) Accept the default Sensitivity of 75. 10) Double-click the WORK_ADDRESS_Stnd field to move it from the Available list box to the Selected list box. 11) Click under Definition and select Address. 12) Accept the default Sensitivity of 85. 4-200 Chapter 4 ACT 13) Double-click the WORK_STATE_Stnd field to move it from the Available list box to the Selected list box. 14) Click under Definition and select State/Province. 15) Accept the default Sensitivity of 85. 16) Double-click the WORK_ZIP field to move it from the Available list box to the Selected list box. 17) Click under Definition and select Postal Code. 18) Accept the default Sensitivity of 85. 19) Click Generate null match codes for blank field values. 20) Click Preserve null values. 21) Click OK to close the Match Codes Properties window. m. Add a Sequencer (Autonumber) node to the job flow. 1) Click in front of the Utilities grouping of nodes. 2) Double-click the Sequencer (Autonumber) node. 3) Type Create Unique ID in the Name field. 4) Type ID in the Field name field. 5) Accept the default of 1 for the Start Number and Interval fields. 6) Click OK to close the Sequencer (Autonumber) Properties window. n. Add the Data Joining node to the data job. 1) Verify that the Nodes riser bar is selected in the Resource pane. 2) Click in front of the Data Integration grouping of nodes. 3) Right-click the Data Joining node and select Data Connections. a) On the job flow diagram, drag the node below and between the two side-by-side branches. b) Right-click the Data Joining node in the job flow and select Node Connections. c) Click in the Connect from area. d) Click the Match Codes node and then click OK. e) Click in the Connect from area. f) Click the Sequencer node and then click OK. g) Click OK to save the changes and close the Data Connections window. 4) Right-click the Data Joining node and select Properties. 4.5 Multi-Input/Multi-Output Data Jobs (Self-Study) 4-201 5) Type Join DB Table and Text File in the Name field. 6) Click Right as the value for the Join type field. 7) Click Right table as the value for the Memory load option field. The performance of the data job accelerates if enough memory is available to load one of the sides into memory. The data job flow should now resemble the following: 4-202 Chapter 4 ACT 8) Specify the following three sets of join conditions: MANUFACTURER_Stnd_MatchCode = COMPANY_Stnd_MatchCode CONTACT_Stnd_MatchCode = NAME_Stnd_MatchCode CONTACT_ADDRESS_Stnd_MatchCode = WORK_ADDRESS_Stnd_MatchCode CONTACT_STATE_PROV_Stnd_MatchCode = WORK_STATE_Stnd_MatchCode or MANUFACTURER_Stnd_MatchCode = COMPANY_Stnd_MatchCode CONTACT_Stnd_MatchCode = NAME_Stnd_MatchCode CONTACT_ADDRESS_Stnd_MatchCode = WORK_ADDRESS_Stnd_MatchCode CONTACT_POSTAL_CD_MatchCode = WORK_ZIP_MatchCode or MANUFACTURER_Stnd_MatchCode = COMPANY_Stnd_MatchCode CONTACT_Stnd_MatchCode = NAME_Stnd_MatchCode CONTACT_PHONE_Stnd = WORK_PHONE_Stnd CONTACT_POSTAL_CD_MatchCode = WORK_ZIP_MatchCode 9) Remove the _MatchCode fields from the Selected list box under the Output fields area. o. Add a Branch node to the data job flow. 1) Click in front of the Utilities grouping of nodes. 2) Double-click the Branch node. 3) Type Branch in the Name field. 4) Click OK to close the Branch Properties window. p. Add a Data Validation node to the data job flow. 1) Click in front of the Utilities grouping of nodes. 2) Double-click the Data Validation node. 3) Type Matches in the Name field. 4) Add the expression ID_1 is not null to the Expression area. 5) Click OK to close the Data Validation Properties window. q. Add a second Data Validation node to the data job flow. 1) Click in front of the Utilities grouping of nodes. 2) Right-click the Data Validation node and select Insert Node. 3) Drag the newest Data Validation node below and to the right of the Branch node. 4) Click the Branch node and drag it to the Data Validation node. A connection is established. 5) Right-click the Data Validation node and select Properties. 6) Type Non-Matches in the Name field. 4.5 Multi-Input/Multi-Output Data Jobs (Self-Study) 7) Add the expression ID_1 is null to the Expression area. 8) Click OK to close the Data Validation Properties window. r. Add a Text File Output node for the matches. 1) Click the Matches data validation node. 2) Click in front of the Data Outputs grouping of nodes. 3) Double-click the Text File Output node. 4) Type Manufacturers - Matches in the Name field. 5) Specify the output file information. a) Click next to the Output file field. b) Navigate to S:\Workshop\dqdmp1\Exercises\files\output_files. c) Type Ch4E6_Manufacturers_Matches.txt in the File name field. d) Click Save. 6) Specify attributes for the file. a) Verify that the Text qualifier field is set to “ (double quotation mark). b) Verify that Field delimiter field is set to Comma. c) Click Include header row. d) Click Display file after job runs. 7) Specify desired fields for the output file. a) Click to move all default selected fields from the list. b) Specify the following fields with the specified output name: Field Name Output Name ID_1 DB_ID MANUFACTURER_1 MANUFACTURER CONTACT_1 CONTACT ID_2 TEXT_ID COMPANY_2 COMPANY NAME_2 NAME 8) Export the field layout. a) Click Export. b) If necessary, navigate to S:\Workshop\dqdmp1\Exercises\files\output_files. 4-203 4-204 Chapter 4 ACT c) Type Ch4E6_Manufacturer_Matches.dfl in the File name field. d) Click Save. 9) Click OK to save the changes and close the Text File Output Properties window. s. Add a Text File Output node for the matches. 1) Click the Non-Matches data validation node. 2) Click in front of the Data Outputs grouping of nodes. 3) Double-click the Text File Output node. 4) Type Manufacturers – Non-Matches in the Name field. 5) Specify the output file information. a) Click next to the Output file field. b) Navigate to S:\Workshop\dqdmp1\Exercises\files\output_files. c) Type Ch4E6_Manufacturers_NonMatches.txt in the File name field. d) Click Save. 6) Specify attributes for the file. a) Verify that the Text qualifier field is set to “ (double quotation mark). b) Verify that the Field delimiter field is set to Comma. c) Click Include header row. d) Click Display file after job runs. 7) Specify desired fields for the output file. a) Click to move all default selected fields from the list. b) Specify the following fields with the specified output name: Field Name Output Name COMPANY_2 COMPANY NAME_2 NAME WORK_ADDRESS_2 WORK_ADDRESS WORK_CITY_2 WORK_CITY WORK_STATE_2 WORK_STATE WORK_ZIP_2 WORK_ZIP WORK_PHONE_2 WORK_PHONE 8) Export the field layout. 4.5 Multi-Input/Multi-Output Data Jobs (Self-Study) a) Click Export. b) If necessary, navigate to S:\Workshop\dqdmp1\Exercises\files\output_files. c) Type Ch4E6_Manufacturer_NonMatches.dfl in the File name field. d) Click Save. 9) Click OK to save the changes and close the Text File Output Properties window. 4-205 4-206 Chapter 4 ACT The final job flow should resemble the following: 4.5 Multi-Input/Multi-Output Data Jobs (Self-Study) t. Save the job. 1) Click the Data Flow tab. 2) Select File Save. u. Run the job. 1) Verify that the Data Flow tab is selected. 2) Select Actions Run Data Job. The text files appear. v. Select File Exit to close each of the Notepad windows. 4-207 4-208 Chapter 4 ACT Chapter 5 MONITOR 5.1 Business Rules ............................................................................................................... 5-3 Demonstration: Creating a Row-Based Business Rule ......................................................... 5-6 Exercises .............................................................................................................................. 5-11 Demonstration: Adding a Business Rule and an Alert to a Profile ....................................... 5-15 Demonstration: Performing a Historical Visualization .......................................................... 5-21 Exercises .............................................................................................................................. 5-25 Demonstration: Creating a Monitoring Job for a Row-Based Rule ...................................... 5-29 Demonstration: Creating Another Data Job Calling the Same Task .................................... 5-38 Exercises .............................................................................................................................. 5-41 Demonstration: Creating a Monitoring Job for a Set-Based Rule ........................................ 5-42 Exercises .............................................................................................................................. 5-48 Demonstration: Viewing the Monitor Dashboard ................................................................. 5-50 5-2 Chapter 5 MONITOR 5.1 Business Rules 5.1 Business Rules Objectives Define a business rule. Create a row-based business rule. 3 What Is a Business Rule? Business Rule 4 4 A formula, validation or comparison that can be applied to a given set of data. Data must either pass or fail the business rule. 5-3 5-4 Chapter 5 MONITOR Business Rule Examples Validate whether product codes adhere to a particular pattern. Test domain values against a list of acceptable values. – Gender is Male, Female, or Unknown. – Region is East, Central, or West. Detect unexpected variances. – The Product Description field contains blanks or nulls. – More than 5% of mailing addresses are not verifiable. – The total sales amount for Q1 is less than the expected target amount. 5 Business rules can be used in both data profiles and data jobs to monitor your data. Creating a Business Rule 6 To create a business rule, do the following: 1) Create Fields. The field names are placeholders. When the business rule is applied to the data, the input data field names are mapped to these placeholder names. 2) Create Rule. This is the criteria that the input data is checked against. 5.1 Business Rules 5-5 Types of Rules Row-based rule Evaluates every row of data passed into the monitoring node Set-based rule Evaluates and applies rules to all of the input data in totality (for example, evaluates 1000 rows as a set) Group-based rule Evaluates and applies all rules to groups of data (for example, if data is grouped by product code, then the rules are evaluated for each product code) When building a business rule, you can do the following: • Take advantage of the Expression Builder that is a wizard-like interface that can be used to help build business rules from a predefined list of data checks • Specify fields that are not being evaluated and enable the passing of data through the monitoring node so that the fields are available in the tasks and can be output into audit files. • Specify the expression or the DataFlux syntax of the business rules that were built with the Expression Builder 5-6 Chapter 5 MONITOR Creating a Row-Based Business Rule In this demonstration, you create a row-based business rule to monitor when an e-mail address and a phone number are missing from a record. 1. If necessary, select Start All Programs DataFlux Data Management Studio 2.2. 2. Verify that the Home tab is selected. 3. Select Tools Business Rules Manager Basics Demos. A tab appears and the Business Rules Manager appears in the Basics Demos repository. 4. Create three needed fields. a. Right-click the Fields folder and select New Field. 1) Type PK in the Name field. 2) Type Primary Key in the Description field. 3) Click OK. 5.1 Business Rules b. Right-click the Fields folder and select New Field. 1) Type EMAIL in the Name field. 2) Type Email Address in the Description field. 3) Click OK. c. Right-click the Fields folder and select New Field. 1) Type PHONE in the Name field. 2) Type Phone Number in the Description field. 3) Click OK. The final set of fields should resemble the following: 5-7 5-8 Chapter 5 MONITOR 5. Create a new row-based rule. a. Right-click the Rules folder and select New Rule. b. Type Email and Phone Missing in the Name field. c. Verify that Row is selected in the Type area. d. Under the Definition area, click Builder on the Checks tab. The Expression Builder window appears. e. Click Get information about the contents of a field (is null, is numeric, etc.) under Get Field Information in the Step 1 area. f. Click to move the check to the Step 2 area. 5.1 Business Rules 5-9 g. Click the Field is null link. The Get Field Information window appears. 1) Click next to the Field field. The Select Field window appears. 2) Click the EMAIL field. 3) Click OK to close the Select Field window. h. Click in the Check field and select is null. i. Click OK to close the Get Field Information window. j. Click Get information about the contents of a field (is null, is numeric, etc.) under Get Field Information in the Step 1 area. k. Click l. to move the check to the Step 2 area. Click the Field is null link. The Get Field Information window appears. 1) Click next to the Field field. The Select Field window appears. 2) Click the PHONE field. 3) Click OK to close the Select Field window. m. Click in the Check field and select is null. n. Click OK to close the Get Field Information window. o. Click OK to close the Expression Builder window. p. Click the Fields tab. q. Double-click the PK field to move it from the Available list to the Selected list. r. Click OK to close the New Rule window. 5-10 Chapter 5 MONITOR The new rule appears in the Rules folder. 6. Select File Close to close the Business Rules Manager tab. 7. If necessary, click the Home tab. 8. If necessary, click the Folders riser bar. 9. Click the Basics Demos repository. 10. Select View Refresh. A Shared Data folder appears. 11. Expand the Shared Data folder. The new rule appears in the Shared Data folder. 5.1 Business Rules 5-11 Exercises 1. Creating a Row-Based Rule Create a row-based rule that performs two checks: Is a field numeric? OR Is the length of the field equal to 14? Some specifics follow: • Use the Basics Exercises repository. • Create two fields – PK (Primary Key) and UPC (Universal Product Code). • Use the check Get information about the contents of a field to specify the check UPC is not numeric. • Use the check Compare the length of a field to a value to specify the check Length of UPC is not equal to 14. • When used, either of the checks is valid (add an OR between the two checks). • Select the PK field in addition to the UPC field. 5-12 Chapter 5 MONITOR 5.2 Data Profiling Objectives Discuss how to define and explore business rules in data profiles. Discuss how to define and explore alerts in data profiles. Discuss how to define historical visualizations in a data profile report. 15 Using a Business Rule in a Data Profile 16 On the Properties tab for a profile, select the Business Rules sub-tab to apply business rule(s) to the data. 5.1 Business Rules 5-13 Surfacing a Business Rule in Profile Reports 17 After the profile is run, on the Report tab, select the Business Rules sub-tab to view the number of violations. Alerts 18 On the Properties tab for a profile, select the Alerts sub-tab to set up alert(s) for the data. After the profile is run, on the Report tab, select the Alerts sub-tab to determine whether any alerts were triggered. 5-14 Chapter 5 MONITOR Historical Visualizations 19 Historical visualizations are customized charts that show how metrics might change over a period of time. 5.1 Business Rules 5-15 Adding a Business Rule and an Alert to a Profile In this demonstration, you examine the steps necessary to add a business rule that was created previously. In addition, you demonstrate how to set up an alert and how to locate and view the alerts in the profile report. 1. If necessary, select Start All Programs DataFlux Data Management Studio 2.2. 2. Verify that the Home tab is selected. 3. Click the Folders riser bar. 4. Expand the Basics Demos repository. 5. Expand the profiles_and_explorations folder. 6. Double-click the Ch3D3_dfConglomerate_Profile profile. 7. Click the Properties tab. 8. Expand dfConglomerate Gifts and click Customers. 9. Click the Business Rules sub-tab. 10. Click Add. The Business Rule Properties window appears. a. Click in the Business rule field and select Email and Phone Missing. b. Click in the Field Name field for PHONE and select BUSINESS PHONE. c. Click in the Field Name field for PK and select ID. 5-16 Chapter 5 MONITOR d. Click the Log data that violates the business rule check box. e. Click OK. The Business Rules tab displays the following: 5.1 Business Rules 11. Click Products (in dfConglomerate Gifts). 12. Click the Alerts sub-tab. 13. Click Add. The Add Alert window appears. a. Click Standard Metric. b. Click Continue. The Add Standard Metric Alert window appears. c. Click in the Field name field and select PRODUCT CODE. d. Click in the Metric field and select Pattern Count. e. Click in the Comparison field under Alert condition and select Metric is greater than. f. Type 3 (three) in the Value field. g. Type Product Code Pattern Count is greater than 3 in the Description field. The Send e-mail option requires that the EMAIL SMTP Server is configured in the app.cfg file. 5-17 5-18 Chapter 5 MONITOR h. Click OK. The Alerts tab displays the following: 14. Select File Save Profile to save the changed profile. 15. Select Actions Run Profile Report. a. Type Additional profile run in the Description field. b. Verify that Append to existing report is selected. c. Click OK to close the Run Profile window. The profile executes. A status of the execution is displayed. The Report tab becomes active. 5.1 Business Rules 5-19 16. Investigate the triggered alert. a. Click in front of dfConglomerate Gifts. Expand Products PRODUCT CODE. Verify that a warning symbol with each selection. b. Click the Products table. c. Click the Alerts sub-tab. The Alert was triggered and is written on the sub-tab. If the Alert was not triggered, there would be no indicators on the Reports tab and nothing written on the Alerts sub-tab for the table. QUESTION: What is the Pattern Count metric for the PRODUCT CODE field? 5-20 Chapter 5 MONITOR 17. Investigate the applied business rule. a. Click the Customers table. b. Click the Business Rules sub-tab. The actual records where the violations occurred are displayed. c. Double-click the summary of the violations. The Business Rule Violation Drill Through window appears. This drill-through information is available only because Log data that violates the business rule was selected. d. Click Close to close the Business Rule Violation Drill Through window. e. Click the Employees table. f. If necessary, click the Business Rules sub-tab. There should not be any violations for table. The business rule was established only for the Customers table. 18. Select File Close Profile. 5.1 Business Rules 5-21 Performing a Historical Visualization In this demonstration, you run a data job that updates the Customers table with standardized phone fields. A profile on the Customers table is rerun and historical metrics are examined visually. 1. Access a data job in the Basics Solutions repository. a. If necessary, select Start All Programs DataFlux Data Management Studio 2.2. b. Verify that the Home tab is selected. c. Click the Folders riser bar. d. Expand the Basics Solutions repository. e. Expand the batch_jobs folder. f. Double-click the Ch5D3_Update_Customers_Table data job. 2. Review the data job. a. Right-click the Data Source node and select Properties. 1) Verify that the ID field and the four phone number fields are selected from the Customers table. 2) Click OK to close the Data Source Properties window. 5-22 Chapter 5 MONITOR b. Right-click the Standardization node and select Properties. 1) Verify that the Phone standardization definition is applied to each of the four phone number fields. 2) Click OK to close the Standardization Properties window. c. Right-click the Data Target (Update) node and select Properties. 1) Verify that the ID Input Field is mapped to the ID Table Field. 2) Verify that the standardized phone fields (those ending with default _Stnd) replace their corresponding phone fields. 3) Click OK to close the Data Target (Update) Properties window. 3. Run and then close the data job. a. Verify that the Data Job tab is selected. b. Select Actions Run Data Job. c. Verify that the job runs with no errors. d. Select File Close. 4. Access and run a profile that involves the Customers table. a. Verify that the Home tab is selected. b. Click the Folders riser bar. c. Expand the Basics Demos repository. 5.1 Business Rules 5-23 d. Expand the profiles_and_explorations folder. e. Double-click the Ch3D3_dfConglomerate_Profile profile. f. Select Actions Run Profile Report. g. Type Additional processing after phone standardizations in the Description field. h. Verify that Append to existing report is selected. i. Click OK to close the Run Profile window. The profile executes. A status of the execution is displayed. The Report tab becomes active. 5. Review the generated information for the Customers table. a. Verify that the Report tab is active. b. Click in front of dfConglomerate Gifts. c. Click the Customers table. d. Click the Visualizations sub-tab. e. Add a new graph. 1) Click . 2) Type Historical Line for Phone Metrics in the Description field. 3) Click in the Chart type field and select Historical Line. 4) Click BUSINESS PHONE under the Fields list. 5) Click Unique Count, Null Count, Blank Count, and Pattern Count under the Metrics list. 6) Select the first date-time value as the value for the Start date field. 7) Select the last date-time value as the value for the End date field. 8) Click OK to close the Chart Properties window. 5-24 Chapter 5 MONITOR The Visualizations tab surfaces the newly specified historic line plot. Your view might differ depending on the number of times that the above profile was run. The Pattern Count and Unique Count metrics decreased for this last execution. 6. Select File Close Profile. 5.1 Business Rules 5-25 Exercises 2. Adding a Business Rule and an Alert to a Profile Access the profile named Ch3E3_dfConglomerateGrocery_Profile (in the Basics Exercises repository). Update the profile as follows: • Add the UPC Check business rule for the BREAKFAST_ITEMS table. Map PK to ID and be sure to log the data that violates the business rules. • On the UPC Check business rule, add an alert that looks for the percentage of failed rows greater than 0 (zero). • Save and rerun the profile. QUESTION: Were there violations to the business rule? If so, what records did not pass the rule check? Answer: QUESTION: Was the alert triggered? Answer: 5-26 Chapter 5 MONITOR 5.3 Data Monitoring Jobs Objectives Discuss how to define a monitoring task. Discuss and use several of the available events. Create a monitoring data job. View the Monitoring report. 30 Monitoring Tasks Monitoring tasks are created by pairing a defined business rule with one or more events. Some available events include the following: Call a realtime service Execute a program Log error to repository Log error to text file Run a local job Run a local profile Send e-mail message Set a data flow key or value Write a row to a table 31 When creating a Monitoring task, select the business rule(s) that are part of the task. For each business rule, select the event(s) that is triggered if there is a rule violation. If the Send email message event is selected, a separate e-mail is sent for each rule violation. 5.1 Business Rules 5-27 Business Rules in Data Jobs 32 In the Data Monitoring node, map the fields used in the task to the input fields in the data job. The Export function from Business Rule Manager suggests a mapping based on field names. If a field is not mapped or mapped incorrectly, select the menu to map the task field to the appropriate field name. A Monitoring task can be applied to one or more data sources, including modified data n a data job. After the job is run, the log lists the number of rows read and the number of events triggered by the Data Monitoring step. Previewing the Data Monitoring node does not cause an event to be triggered. An event is triggered only when the job is run. 5-28 Chapter 5 MONITOR Monitoring Report The Monitoring report provides a summary listing of monitoring task executions for a selected repository. Details of violations are only listed if the Log error to repository event was selected for the rule in the monitoring task. 33 The listing of items can be sorted or grouped to enable easier viewing of the results of the Monitoring task executions. The data can also be exported as a CSV or XML file. 5.1 Business Rules 5-29 Creating a Monitoring Job for a Row-Based Rule In this demonstration, you create a task that combines a previously created rule with two events. This task is exported as a data job, and the data job is executed. The results are reviewed. 1. If necessary, select Start All Programs DataFlux Data Management Studio 2.2. 2. Verify that the Home tab is selected. 3. Select Tools Business Rules Manager Basics Demos. A tab appears and the Business Rules Manager appears in the Basics Demos repository. 4. Right-click the Tasks folder and select New Task. a. Type Email and Phone Missing in the Name field. b. Double-click the Email and Phone Missing rule to move it from the Available list to the Selected list. c. With the rule highlighted in the Selected list, click Rule Details. The Rule Details window appears. 1) Click Add in the Events window. a) Click in the Select event field and select Log error to repository. b) Click Continue. to move the three fields (EMAIL, PHONE, PK) from Available fields to c) Click Selected fields. d) Click OK to close the Log Error To Repository Event Properties window. 2) Click Add in the Rule Details window. a) Click in the Select event field and select Log error to text file. b) Click Continue. c) Click next to the File field. (1) Navigate to S:\Workshop\dqdmp1\Demos\files\output_files. (2) Type Ch5D4_EmailAndPhoneMissing.txt in the File name field. (3) Click Save. d) Type PK where Email and Phone are Both Missing in the Header field. e) Type %%PK%% in the Message field. 5-30 Chapter 5 MONITOR f) Type --End of File-- in the Footer field. The final settings for the Log Error To Text File Event Properties window should resemble the following: g) Click OK to close the Log Error To Text File Event Properties window. 3) Click OK to close the Rule Details window. d. Click OK to close the New Task window. 5.1 Business Rules 5. Right-click the Sources folder and select New Source. a. Type Customers in the Name field. b. Type Customers Table in the Description field. c. Click OK to close the New Source window. 6. Right-click the Sources folder and select New Source. a. Type Employees in the Name field. b. Type Employees Table as the value for the Description field. c. Click OK to close the New Source window. The two sources should appear as follows: 5-31 5-32 Chapter 5 MONITOR 7. If necessary, expand the Tasks folder. 8. Right-click the Email and Phone Missing task and select Export To Data Job. a. Verify that Data Source is selected. b. Click Next. c. Click next to the Input table field. 1) Expand the dfConglomerate Gifts data source. 2) Click the Customers table. 3) Click OK to close the Select Table window. d. Move all fields from the Selected list to the Available list. e. Double-click ID to move it from the Available list to the Selected list. f. Double-click EMAIL to move it from the Available list to the Selected list. g. Double-click BUSINESS PHONE to move it from the Available list to the Selected list. h. Click Next. 5.1 Business Rules i. Click in the Data Source Field field for the PHONE task field and select BUSINESS PHONE. j. Click in the Data Source Field field for the PK task field and select ID. k. Click Next. l. Type Both Email And Phone Missing in the Description field. m. Click next to the Job name field. 1) Type Ch5D4_BothEmailAndPhoneMissing in the Name field. 2) If necessary, select Basics Demos as the value for the Save in field. 3) Double-click the batch_jobs folder. 4) Click Save. n. Click in the Source field and select Customers. o. Click Finish. If the folders for Basics Demos do not appear when saving the above data job, do the following: • Click the Home tab. • Click the Administration riser bar. • Click to expand Repository Definitions. 5-33 5-34 Chapter 5 MONITOR • Right-click Basics Demos and select Update lineage. • Click Close to close the Updating Repository Lineage window. 9. Access the new data job. a. Click the Home tab. b. Click the Folders riser bar. c. Expand Basics Demos batch_jobs. d. Double-click the Ch5D4_BothEmailAndPhoneMissing data job. e. If necessary, click . The data job should resemble the following: 10. Specify updated attributes for each of the nodes. a. If necessary, click the Data Source node in the job diagram. b. Verify that the Details pane is displayed. c. If necessary, click the Basic Settings tab. d. Type Customers Table as the value for the Name field. e. Type Data Source as the value for the Description field. f. Click the Data Monitoring node in the data flow diagram. g. Type Email/Phone Missing as the value for the Name field. h. Type Data Monitoring as the value for the Description field. The data flow should resemble the following: 11. Select File Save. 5.1 Business Rules 12. Select Actions Run Data Job. 13. When the job completes, click the Log tab. A message was written to the log for each violation that was encountered. 14. Access Windows Explorer and examine the text file that was created. a. Select Start All Programs Accessories Windows Explorer to access Windows Explorer. b. Navigate to S:\Workshop\dqdmp1\Demos\files\output_files. c. Double-click Ch5D4_EmailAndPhoneMissing.txt. A Notepad window appears with the results of the cluster edits. d. Select File Exit to close the Notepad window. 5-35 5-36 Chapter 5 MONITOR 15. From the data job, select Tools Monitoring Report Basics Demos. 16. Click the Trigger Values tab. 5.1 Business Rules 5-37 The specific violations are listed only because you selected the log error for the repository event. The summary information is written regardless, but the specifics are listed only if there are violations and this event is selected. 17. Click the History Graph tab. 18. Click the bar chart tool ( ). 19. Select File Close to close the tab for the Monitoring Report. 20. Select File Close to close the tab for the data job. 21. Select File Close to close the Business Rules Manager tab. 5-38 Chapter 5 MONITOR Creating Another Data Job Calling the Same Task 1. If necessary, select Start All Programs DataFlux Data Management Studio 2.2. 2. Verify that the Home tab is selected. 3. Click the Folders riser bar. 4. Click to expand the Basics Demos repository. 5. Click to expand the batch_jobs folder. 6. Right-click the Ch5D4_BothEmailAndPhoneMissing data job and select Copy to Folder. a. If necessary, navigate to Basics Demos and select the batch_jobs folder in the Copy To Folder window. b. Click Open. c. Click Rename the new item. d. Click OK. 7. Right-click the new data job (Copy of Ch5D4_BothEmailAndPhoneMissing) and select Rename. a. Type Ch5D5_BothEmailAndPhoneMissing for the new name. b. Press ENTER. 8. Double-click the new data job (Ch5D5_BothEmailAndPhoneMissing). The data job opens on a new tab. 5.1 Business Rules 9. Change the data source of the data job. a. Double-click the Data Source node (or right-click and select Properties). b. Type Employees Table as the value for the Name field. c. Click next to the Input table field. 1) If necessary, expand the dfConglomerate Gifts data source. 2) Click the Employees table. 3) Click OK to close the Select Table window. d. Click to remove all selected fields. e. Double-click ID to move it from the Available list to the Selected list. f. Double-click EMAIL to move it from the Available list to the Selected list. g. Double-click BUSINESS PHONE to move it from the Available list to the Selected list. h. Click OK to close the Data Source Properties window. 10. Edit the Data Monitoring node. a. Double-click the Data Monitoring node (or right-click and select Properties). b. Click in the Source field and select Employees. c. Click OK to close the Data Monitoring Properties window. 11. Select File Save. 12. Select Actions Run Data Job. 13. When the job completes, click the Log tab. There were no violations this time. Therefore, the text file is not created. 5-39 5-40 Chapter 5 MONITOR 14. From the data job, select Tools Monitoring Report Basics Demos. 15. Verify that no triggers occurred. 16. Select File Close to close the tab for the Monitor Viewer. 17. Select File Close to close the tab for the data job. 5.1 Business Rules Exercises 3. Creating a Monitoring Job for a Row-Based Rule • Create a task in the Demos Exercises repository - Use the UPC Check rule. - Add an event of Log Error to the Repository. • Create a source in the Demos Exercises repository (use a name of Breakfast Items with a description of Breakfast Items Table). • Export the new task as a data job with the following specifics: - Use the BREAKFAST_ITEMS table in the dfConglomerate Grocery data connection. - Pass just the ID and UPC fields to the Data Monitoring node. - Map the ID data source field to the PK task field. - Specify a data job name of Ch5E3_UPCCheck with a description of Check UPC Field. • Run the job, and review the Log. QUESTION: How many records were monitored? Answer: QUESTION: How many records triggered the monitoring event? Answer: • View the records (if there are any) that triggered the monitoring event. 5-41 5-42 Chapter 5 MONITOR Creating a Monitoring Job for a Set-Based Rule In this demonstration, you create a new field for use in a new rule where the rule is a set-based rule. The rule is combined with an event for a task and the new task is used in a data job. 1. If necessary, select Start All Programs DataFlux Data Management Studio 2.2. 2. Verify that the Home tab is selected. 3. Select Tools Business Rules Manager Basics Demos. 4. Create a new field by right-clicking the Fields folder and selecting New Field. a. Type COUNTRY in the Name field. b. Type Country Value in the Description field. c. Click OK. 5. Create a new set-based rule. a. Right-click the Rules folder and select New Rule. b. Type Country Pattern Count Check in the Name field. c. Click Set in the Type area. d. Under Definition, click Builder on the Checks tab. The Expression Builder window appears. e. Click Compare a field to a value or to another field under Compare Entire Field in the Step 1 area. f. Click to move the check to the Step 2 area. 5.1 Business Rules 5-43 g. Click the Field is equal to Value link. The Get Field Information window appears. 1) Click next to the Field field. The Select Field window appears. 2) Click to expand the COUNTRY field. 3) Click to expand the Standard folder. 4) Click PATCOUNT. 5) Click OK to close the Select Field window. h. Click i. Type 1 (one) in the Value field. j. in the Check field and select greater than. SM represents Standard Metric and HSM represents Historical Standard Metric. Click OK to close the Compare To Value Or Field window. k. Click OK to close the Expression Builder window. l. Click OK to close the New Rule window. 6. Right-click the Tasks folder and select New Task. a. Type Country Pattern Count Check in the Name field. b. Double-click the Country Pattern Count Check rule to move it from the Available list to the Selected list. c. With the rule highlighted in the Selected list, click Rule Details. The Events window appears. 5-44 Chapter 5 MONITOR 1) Click Add in the Events window. a) Click in the Select event field and select Log error to repository. b) Click Continue. c) Double-click SM$COUNTRY$BLANKCOUNT to move it from Available fields to Selected fields. d) Double-click SM$COUNTRY$NULLCOUNT to move it from Available fields to Selected fields. e) Double-click SM$COUNTRY$PATCOUNT to move it from Available fields to Selected fields. f) Double-click SM$COUNTRY$PATMODE to move it from Available fields to Selected fields. g) Click OK to close the Log Error To Repository Event Properties window. 2) Click OK to close the Events window. d. Click OK to close the New Task window. 7. Select File Close to close the Business Rules Manager tab. 8. Create a data job to call the task. a. If necessary, click the Home tab. b. Verify that the Folders riser bar is selected. c. Expand Basics Demos batch_jobs. d. Click Data Job. 1) Type Ch5D6_StandardizeCountry_CheckPattern in the Name field. 2) Click OK. 5.1 Business Rules 5-45 e. Add the Data Source node to the Data Flow Editor. 1) Verify that the Nodes riser bar is selected in the resource pane. 2) Click in front of the Data Inputs grouping of nodes. 3) Double-click the Data Source node. f. Specify properties for the Data Source node. 1) Type Customers Table in the Name field. 2) Click a) Click next to the Input table field. in front of the dfConglomerate Gifts data source. b) Click Customers. c) Click OK to close the Select Table window. 3) Click to remove all selected fields. 4) Double-click COUNTRY/REGION to move it from the Available list to the Selected list. 5) Click OK to save the changes and close the Data Source Properties window. g. Add a Standardization node to the job flow. 1) In the Nodes resource pane, click 2) Click to collapse the Data Inputs grouping of nodes. in front of the Quality grouping of nodes. 3) Double-click the Standardization node. h. Specify properties for the Standardization node. 1) Type Standardize Country in the Name field. 2) Double-click COUNTRY/REGION to move it from the Available list to the Selected list. 3) For the COUNTRY/REGION field, click Country (ISO 3 Char). under Definition and select 4) Click Preserve null values. 5) Click OK to save the changes and close the Standardization Properties window. i. Add a Data Monitoring node to the job flow. 1) In the Nodes resource pane, click 2) Click to collapse the Quality grouping of nodes. in front of the Monitor grouping of nodes. 3) Double-click the Data Monitoring node. 5-46 Chapter 5 MONITOR j. Specify properties for the Data Monitoring node. 1) Type Check Country Pattern in the Name field. 2) Click for the Task name field and select Country Pattern Count Check. 3) Type Check Country Pattern in the Description field. 4) Click for the Source field and select Customers. 5) Click under the Field Name field and select COUNTRY/REGION_Stnd (for the task field of COUNTRY). 6) Click OK to save the changes and close the Data Monitoring Properties window. 9. Select File Save Data Job. 10. Select Actions Run Data Job. 11. When the job completes, click the Log tab. A message was written to the log for each violation that was encountered. The number of aggregate rows read can be only 1 and the number of triggered events can be only 0 or 1 because it is a set rule. 5.1 Business Rules 12. Select Tools Monitoring Report Basics Demos. 13. From the Details pane, click the Trigger Values tab. 14. Select File Close to close the tab for the Monitor Viewer. 15. Select File Close to close the tab for the data job. 5-47 5-48 Chapter 5 MONITOR Exercises 4. Creating a Monitoring Job for a Group-Based Rule • Create a new field BRAND (Brand Name) in the Demos Exercises repository. • Create a group-based rule named Low Brand Count that groups on distinct values of BRAND. The rule should check for the counts of the values of BRAND being less than 5 (five). • Create a task (named Low Brand Count) that calls the Low Brand Count rule with a Log Error to Repository event. Log the fields BRAND and SM$BRAND$COUNT. • Export the new task as a data job named Ch5E4_LowBrandCount by selecting Basics Exercises batch_jobs. • Run the new job. QUESTION: How many aggregate records were monitored? Answer: QUESTION: How many records triggered the monitoring event? Answer: • View the records (if there are any) that triggered the monitoring event. 5.1 Business Rules 5-49 Monitoring Dashboards 50 5 0 Displays alert trigger information for a selected repository on a separate tab. Each tab contains the following elements: • Triggers Per Date (Last X Days) • Triggers Per Records Processed • Triggers Per Source (Last X Days) • Triggers Per User (Last X Days) The Triggers per Date graph shows a percentage of triggers per records processed. The graph scale is from 0 to 100, representing percent. For example, if you have 100 triggers in 5000 rows, the triggers represent 2.0% of the records processed. If you move your cursor over the data point, you get a tooltip with helpful information such as “... 11 triggers in 3276 rows.” The number of days for the Last X Days is configured as part of the Monitoring options by selecting Tools Data Management Studio Options from the menu. 5-50 Chapter 5 MONITOR Viewing the Monitor Dashboard In this demonstration, you examine the accumulated information in the Basics Demos repository via the Monitor dashboard. 1. If necessary, select Start All Programs DataFlux Data Management Studio 2.2. DataFlux Data Management Studio appears. 2. Verify that the Home tab is selected. 3. Click the Information riser bar. 4. Click Monitor from the Information pane. 5. Click the tab labeled Basics Demos - Dashboard. 5.1 Business Rules 5-51 The Triggers per Date graph shows a percentage of triggers per records processed. The graph scale is from 0 to 100, representing percent. For example, if you have 100 triggers in 5000 rows, the triggers represent 2.0% of the records processed. If you move your cursor over the data point, you get a tooltip with helpful information such as “... 11 triggers in 3276 rows.” The Number of Days for the Last X Days is configured as part of the Monitoring options by selecting Tools Data Management Studio Options from the menu. 5-52 Chapter 5 MONITOR 5.4 Solutions to Exercises 1. Creating a Row-Based Rule a. If necessary, select Start All Programs DataFlux Data Management Studio 2.2. b. Select Tools Business Rules Manager Basics Exercises. A tab appears and the Business Rules Manager appears in the Basics Demos repository. c. Create two needed fields. 1) Right-click the Fields folder and select New Field. a) Type PK in the Name field. b) Type Primary Key in the Description field. c) Click OK. 2) Right-click the Fields folder and select New Field. a) Type UPC in the Name field. b) Type Universal Product Code in the Description field. c) Click OK. d. Create a new row-based rule. 1) Right-click the Rules folder and select New Rule. 2) Type UPC Check in the Name field. 3) Verify that Row is selected in the Type area. 4) Under Definition, click Builder on the Checks tab. a) Click Get information about the contents of a field (is null, is numeric, etc.) under Get Field Information in the Step 1 area. b) Click to move the check to the Step 2 area. c) Click the Field is null link. (1) Click next to the Field field. The Select Field window appears. (i) Click the UPC field. (ii) Click OK to close the Select Field window. (2) Click in the Check field and select is not numeric. (3) Click OK to close the Get Field Information window. 5.1 Business Rules 5-53 d) Click OR. e) Click Compare the length of a field to a value under Get Field Information in the Step 1 area. f) Click to move the check to the Step 2 area. g) Click the Length of Field is equal to Value link. (1) Click next to the Field field. The Select Field window appears. (i) Click the UPC field. (ii) Click OK to close the Select Field window. (2) Click in the Operator field and select not equal to. (3) Type 14 in the Length field. 5-54 Chapter 5 MONITOR (4) Click OK to close the Compare Length To Value window. h) Click OK to close the Expression Builder window. The Definition area of the Checks tab should resemble the following: 5) Click the Fields tab. 6) Double-click the PK field to move it from the Available list to the Selected list. 7) Click OK to close the New Rule window. e. Select File Close to close the Business Rules Manager tab. 2. Adding a Business Rule and an Alert to a Profile a. If necessary, select Start All Programs DataFlux Data Management Studio 2.2. b. Verify that the Home tab is selected. c. Click the Folders riser bar. d. Expand the Basics Exercises repository. e. Expand the profiles_and_explorations folder. f. Double-click the Ch3E3_dfConglomerateGrocery_Profile profile. g. Click the Properties tab. h. Expand dfConglomerate Grocery and click BREAKFAST_ITEMS. i. Click the Business Rules sub-tab. j. Click Add. 1) Click in the Business rule field and select UPC Check. 2) Click in the Field Name field for PK and select ID. 3) Click Log data that violates the business rule. 4) Click OK. 5.1 Business Rules 5-55 k. Click the Alerts sub-tab. l. Click Add. 1) Click Business Rule. 2) Click Continue. 3) Click in the Business rule field and select UPC Check CODE. 4) Click in the Comparison field under Alert condition and select Percentage of failed rows is greater than. 5) Type 0 (zero) in the Value field. 6) Type UPC field - Percentage of failed rows is greater than 0 in the Description field. 7) Click OK. m. Select File Save Profile to save the changed profile. n. Select Actions Run Profile Report. 1) Type Third profile in the Description field. 2) Verify that Append to existing report is selected. 3) Click OK to close the Run Profile window. The profile executes. A status of the execution is displayed. The Report tab becomes active. o. Review the Profile report. 1) Click in front of dfConglomerate Gifts. 2) Click the BREAKFAST_ITEMS table. 3) Click the Business Rules sub-tab. QUESTION: Were there violations to the business rule? Answer: Yes QUESTION: If so, what records did not pass the rule check? Answer: Double-click the UPC Check item on the Business Rules sub-tab. 5-56 Chapter 5 MONITOR 4) Click the Alerts sub-tab. The icon indicating that the alert was met is not in the UPC field. The alert is on a business rule and not a specific field as is the case when you set a standard metric alert. QUESTION: Was the alert triggered? Answer: Yes p. Select File Close Profile. 3. Creating a Monitoring Job for a Row-Based Rule a. If necessary, select Start All Programs DataFlux Data Management Studio 2.2. b. Verify that the Home tab is selected. c. Select Tools Business Rules Manager Basics Exercises. d. Right-click the Tasks folder and select New Task. 1) Type UPC Check in the Name field. 2) Double-click the UPC Check rule to move it from the Available list to the Selected list. 3) With the rule highlighted in the Selected list, click Events. a) Click Add in the Events window. (1) Click in the Select event field and select Log error to repository. (2) Click Continue. (3) Click to move the two fields (PK, UPC) from Available fields to Selected fields. (4) Click OK to close the Log Error To Repository Event Properties window. b) Click OK to close the Events window. 4) Click OK to close the New Task window. e. Right-click the Sources folder and select New Source. 1) Type Breakfast Items in the Name field. 2) Type Breakfast Items Table in the Description field. 3) Click OK to close the Source Properties window. 5.1 Business Rules f. If necessary, expand the Tasks folder. g. Right-click the UPC Check task and select Export to Data Job. 1) Verify that Data Source is selected. 2) Click Next. next to the Input table field. a) Click (1) Expand the dfConglomerate Grocery data source. (2) Click the BREAKFAST_ITEMS table. (3) Click OK to close the Select Table window. to move all the selected columns from the list. b) Click c) Double-click ID to move it from the Available list to the Selected list. d) Double-click UPC to move it from the Available list to the Selected list. 3) Click Next. 4) Click in the Data Source Field field for the PK task field and select ID. 5) Click Next. 6) Type Check UPC Field in the Description field. 7) Click next to the Job name field. a) Type Ch5E3_UPCCheck in the Name field. b) If necessary, select Basics Exercises as the value for the Save in field. c) Double-click the batch_jobs folder. d) Click Save. 8) Click in the Source field and select Breakfast Items. 9) Click Finish. h. Review the data job. 1) Click the Home tab. 2) Verify that the Folders riser bar is selected. 3) Expand Basics Exercises batch_jobs. 4) Double-click the Ch5E3_UPCCheck data job. 5) Click . i. Select Actions Run Data Job. 5-57 5-58 Chapter 5 MONITOR j. When the job completes, click the Log tab. QUESTION: How many records were monitored? Answer: 2617 QUESTION: How many records triggered the monitoring event? Answer: 85 k. From the data job, select Tools Monitoring Report Basics Exercises. l. Click the Trigger Values tab in the Details pane. m. Click Close to close the Record Detail window. n. Select File Close to close the tab for the Monitor Viewer. o. Select File Close to close the tab for the data job. p. Select File Close to close the Business Rules Manager tab. 4. Creating a Monitoring Job for a Group-Based Rule a. If necessary, select Start All Programs DataFlux Data Management Studio 2.2. b. Verify that the Home tab is selected. c. Select Tools Business Rules Manager Basics Exercises. d. Create a new field by right-clicking the Fields folder and selecting New Field. 1) Type BRAND in the Name field. 2) Type Brand Name in the Description field. 3) Click OK. e. Create a new group-based rule. 1) Right-click the Rules folder and select New Rule. 2) Type Low Brand Count in the Name field. 3) Click Group in the Type area. 4) Click Group By. a) Double-click BRAND to move it to the Selected list in the Group By window. b) Click OK to close the Group By window. 5.1 Business Rules 5) Under Definition, click Builder on the Checks tab. 6) Click Compare a field to a value or to another field under Compare Entire Field in the Step 1 area. to move the check to the Step 2 area. 7) Click 8) Click the Field is equal to Value link. The Get Field Information window appears. a) Click next to the Field field. The Select Field window appears. b) Click to expand the BRAND field. c) Click to expand the Standard folder. d) Click COUNT. e) Click OK to close the Select Field window. 9) Click in the Check field and select less than. 10) Type 5 (five) in the Value field. 11) Click OK to close the Compare To Value Or Field window. 12) Click OK to close the Expression Builder window. 13) Click OK to close the New Rule window. f. Right-click the Tasks folder and select New Task. 5-59 5-60 Chapter 5 MONITOR 1) Type Low Brand Count in the Name field. 2) Double-click the Low Brand Count rule to move it from the Available list to the Selected list. 3) With the rule highlighted in the Selected list, click Rule Details. a) Click Add in the Events window. (1) Click in the Select event field and select Log error to repository. (2) Click Continue. (3) Double-click BRAND to move it from Available fields to Selected fields. (4) Double-click SM$BRAND$COUNT to move it from Available fields to Selected fields. (5) Click OK to close the Log Error To Repository Event Properties window. b) Click OK to close the Events window. 4) Click OK to close the New Task window. g. Right-click the Low Brand Count task and select Export to Data Job. 1) Verify that Data Source is selected. 2) Click Next. 3) Click next to the Input table field. a) Expand the dfConglomerate Grocery data source. b) Click the BREAKFAST_ITEMS table. c) Click OK to close the Select Table window. 4) Click to remove all selected fields. 5) Double-click BRAND to move it from the Available list to the Selected list. 6) Click Next. 7) Verify that the BRAND field from the data source is mapped to the task field of the same name. 5.1 Business Rules 8) Click Next. 9) Type Low Brand Count Group Rule in the Description field. 10) Click next to the Job name field. a) Type Ch5E4_LowBrandCount in the Name field. b) If necessary, select Basics Demos as the value for the Save in field. c) Double-click the batch_jobs folder. d) Click Save. 11) Click in the Source field and select Breakfast Items. 12) Click Finish. h. Click the Home tab. i. Verify that the Folders riser bar is selected. j. Expand Basics Exercises batch_jobs. k. Double-click the Ch5E4_LowBrandCount data job. l. Click . m. Select Actions Run Data Job. n. When the job completes, click the Log tab. A message was written to the log for each violation that was encountered. QUESTION: How many aggregate records were monitored? Answer: 255 QUESTION: How many records triggered the monitoring event? Answer: 138 o. From the data job, select Tools Monitoring Report Basics Exercises. p. Click the Trigger Values tab in the Details pane. q. Click Close to close the Record Detail window. r. Select File Close to close the tab for the Monitor Viewer. s. Select File Close to close the tab for the data job. t. Select File Close to close the Business Rules Manager tab. 5-61 5-62 Chapter 5 MONITOR