Telechargé par Alex Halawi

DQ22DMP1 001

publicité
DataFlux Data
Management Studio:
Basics
Course Notes
DataFlux Data Management Studio: Basics Course Notes was developed by Kari Richardson. Editing
and production support was provided by the Curriculum Development and Support Department.
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of
SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product
names are trademarks of their respective companies.
DataFlux Data Management Studio: Basics Course Notes
Copyright © 2012 SAS Institute Inc. Cary, NC, USA. All rights reserved. Printed in the United States of
America. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in
any form or by any means, electronic, mechanical, photocopying, or otherwise, without the prior written
permission of the publisher, SAS Institute Inc.
Book code E2229, course code DQ22DMP1, prepared date 09May2012.
DQ22DMP1_001
ISBN 978-1-61290-309-5
For Your Information
Table of Contents
Course Description ...................................................................................................................... vi
Prerequisites ............................................................................................................................... vii
Chapter 1
Introduction to DataFlux Methodology and Course Flow ................. 1-1
1.1
Introduction ...................................................................................................................... 1-3
1.2
Course Flow ..................................................................................................................... 1-6
Chapter 2
2.1
DataFlux Data Management Studio: Getting Started ......................... 2-1
Introduction ...................................................................................................................... 2-3
Demonstration: Navigating the DataFlux Data Management Studio Interface......... 2-6
Demonstration: Creating a DataFlux Data Management Studio Repository .......... 2-20
Exercises.................................................................................................................. 2-26
2.2
Quality Knowledge Base and Reference Sources .......................................................... 2-27
Demonstration: Verifying the Course QKB and Reference Sources ....................... 2-31
2.3
Data Connections ........................................................................................................... 2-33
Demonstration: Working with Data Connections ................................................... 2-36
Exercises.................................................................................................................. 2-53
2.4
Solutions to Exercises .................................................................................................... 2-54
Chapter 3
3.1
PLAN ...................................................................................................... 3-1
Creating Data Collections ................................................................................................ 3-3
Demonstration: Creating Collections of Address and Note Fields ........................... 3-5
Exercises.................................................................................................................. 3-13
Demonstration: Creating and Exploring a Data Exploration ................................. 3-18
3.3
Creating Data Profiles .................................................................................................... 3-32
Demonstration: Creating and Exploring a Data Profile .......................................... 3-39
iii
iv
For Your Information
Exercises.................................................................................................................. 3-53
Demonstration: Profiling Data Using Text File Input ............................................ 3-56
Demonstration: Profiling Data Using Filtered Data and an SQL Query ................ 3-61
Demonstration: Profiling Data Using a Collection ................................................ 3-68
Exercises.................................................................................................................. 3-70
Demonstration: Data Profiling – Additional Analysis (Self-Study) ........................ 3-73
3.4
Designing Data Standardization Schemes ..................................................................... 3-83
Demonstration: Creating a Phrase Standardization Scheme .................................. 3-85
Demonstration: Creating an Element Standardization Scheme .............................. 3-92
Demonstration: Importing a Scheme from a Text File ........................................... 3-96
Exercises.................................................................................................................. 3-98
3.5
Solutions to Exercises .................................................................................................... 3-99
Chapter 4
4.1
ACT ........................................................................................................ 4-1
Introduction to Data Jobs ................................................................................................. 4-3
Demonstration: Setting DataFlux Data Management Studio Options ..................... 4-5
Demonstration: Creating and Running a Simple Data Job ..................................... 4-10
Exercises.................................................................................................................. 4-33
4.2
Data Quality Jobs ........................................................................................................... 4-34
Demonstration: Investigating Standardizing, Parsing, and Casing ......................... 4-38
Demonstration: Investigating Right Fielding and Identification Analysis ............. 4-50
Exercises.................................................................................................................. 4-61
Demonstration: Using a Standardization Definition and a Standardization
Scheme .......................................................................................... 4-63
4.3
Data Enrichment Jobs (Self-Study)................................................................................ 4-75
Demonstration: Performing Address Verification and Geocoding .......................... 4-77
Exercises.................................................................................................................. 4-88
4.4
Entity Resolution Jobs ................................................................................................... 4-90
Demonstration: Creating a Data Job to Cluster Records......................................... 4-93
For Your Information
Exercises................................................................................................................ 4-108
Demonstration: Creating an Entity Resolution Job ............................................... 4-112
Demonstration: Creating a Data Job to Compare Clusters (Optional) .................. 4-138
4.5
Multi-Input/Multi-Output Data Jobs (Self-Study) ....................................................... 4-147
Demonstration: Multi-Input/Multi-Output Data Job: New Products .................... 4-149
Demonstration: Multi-Input/Multi-Output Data Job: Customer Matches ............. 4-159
Exercises................................................................................................................ 4-166
Chapter 5
5.1
MONITOR............................................................................................... 5-1
Business Rules ................................................................................................................. 5-3
Demonstration: Creating a Row-Based Business Rule ............................................. 5-6
Exercises.................................................................................................................. 5-11
Demonstration: Adding a Business Rule and an Alert to a Profile ......................... 5-15
Demonstration: Performing a Historical Visualization ........................................... 5-21
Exercises.................................................................................................................. 5-25
Demonstration: Creating a Monitoring Job for a Row-Based Rule ........................ 5-29
Demonstration: Creating Another Data Job Calling the Same Task ....................... 5-38
Exercises.................................................................................................................. 5-41
Demonstration: Creating a Monitoring Job for a Set-Based Rule .......................... 5-42
Exercises.................................................................................................................. 5-48
Demonstration: Viewing the Monitor Dashboard ................................................... 5-50
v
vi
For Your Information
Course Description
This course is for data quality stewards who perform data management tasks, such as data quality, data
enrichment and entity resolution.
To learn more…
For information about other courses in the curriculum, contact the SAS
Education Division at 1-800-333-7660, or send e-mail to [email protected]
You can also find this information on the Web at support.sas.com/training/
as well as in the Training Course Catalog.
For a list of other SAS books that relate to the topics covered in this
Course Notes, USA customers can contact our SAS Publishing Department
at 1-800-727-3228 or send e-mail to [email protected] Customers outside
the USA, please contact your local SAS office.
Also, see the Publications Catalog on the Web at support.sas.com/pubs for
a complete list of books and a convenient order form.
For Your Information
Prerequisites
There currently are no prerequisites for this course.
vii
viii
For Your Information
Chapter 1 Introduction to DataFlux
Methodology and Course Flow
1.1
Introduction..................................................................................................................... 1-3
1.2
Course Flow .................................................................................................................... 1-6
1-2
Chapter 1 Introduction to DataFlux Methodology and Course Flow
1.1 Introduction
1.1
Introduction
Objectives


Provide an overview of DataFlux.
Explore the DataFlux Data Management methodology.
3
Introduction to DataFlux
DataFlux is a leading provider of data management
solutions.
 Founded in 1997
 Acquired by SAS in 2000
 Operates as a wholly owned subsidiary of SAS
SAS is the world’s largest privately held software
company.
 Market leader in analytics and business intelligence
solutions
 DataFlux technology is embedded in several SAS
applications.
4
4
1-3
1-4
Chapter 1 Introduction to DataFlux Methodology and Course Flow
DataFlux Data Management Methodology
DEFINE
CONTROL
PLAN
MONITOR
DISCOVER
EVALUATE
DESIGN
EXECUTE
ACT
5
5
The DataFlux Data Management Methodology is a step-by-step process for performing data management
tasks, such as data quality, data integration, data migrations, and MDM (Master Data Management).
When organizations plan, take action on, and monitor data management projects, they build the
foundation to optimize revenue, control costs, and mitigate risks. No matter what type of data you
manage, DataFlux technology can help you gain a more complete view of corporate information.
PLAN
DEFINE
The planning stage of any data management project starts with this essential first step.
This is where the people, processes, technologies, and data sources are defined.
Roadmaps are built that include articulating the acceptable outcomes. Finally, the
cross-functional teams across business units and between business and IT communities
are created to define the data management business rules.
DISCOVER
A quick inspection of your corporate data would probably find that it resides in many
different databases. The data is managed by many different systems, with many
different formats and representations of the same data. This step of the methodology
enables you to explore metadata to verify that the right data sources are included
in the data management program. It also enables you to create detailed data profiles
of identified data sources to understand their strengths and weaknesses.
1.1 Introduction
1-5
ACT
DESIGN
After you complete the first two steps, this phase enables you to take the different
structures, formats, data sources, and data feeds, and create an environment that
accommodates the needs of your business. At this step, business and IT users build
jobs to enforce business rules for data quality and data integration. They create data
models to house data in consolidated or master data sources.
EXECUTE
After business users establish how the data and rules should be defined, the IT staff can
install them in the IT infrastructure and determine the integration method – real time,
batch, or virtual. These business rules can be reused and redeployed across
applications, which helps increase data consistency in the enterprise.
MONITOR
EVALUATE
This step of the methodology enables users to define and enforce business rules
to measure the consistency, accuracy, and reliability of new data as it enters the
enterprise. Reports and dashboards about critical data metrics are created for business
and IT staff members. The information gained from data monitoring reports is used
to refine and adjust the business rules.
CONTROL
The final stage in a data management project involves examining any trends to validate
the extended use and retention of the data. Data that is no longer useful is retired.
The project’s success can then be shared throughout the organization. The next steps
are communicated to the data management team to lay the groundwork for future data
management efforts.
1-6
Chapter 1 Introduction to DataFlux Methodology and Course Flow
1.2 Course Flow
Objectives


Review the steps for the course.
Overview the course data.
7
Course Tasks – Step 1
Step 1:
DataFlux Data Management Studio Basics
Explore the DataFlux
Data Management Studio
Interface
Create Course
Repositories
Define Data
Connections
8
8
This step involves becoming familiar with the DataFlux Data Management Studio interface. As part of the
exploration, you define a repository and verify the course QKB (Quality Knowledge Base). In addition, a
new data connection is defined.
1.2 Course Flow
1-7
Course Tasks – Step 2
Step 2:
PLAN: Connect to, explore, and profile data;
design standardization schemes.
Explore data values
in a data profile
Explore data information
in a field relationship map
Build a scheme
9
9
The PLAN phase of the DataFlux Data Management Methodology involves exploring the structural
information in the data sources and analyzing the data values of the data sources. As a result of that
analysis, this phase builds necessary or needed schemes.
Course Tasks – Step 3
Step 3:
ACT: Build data jobs for quality and entity
resolution.
10
1
0
The analysis and exploration of the data sources can lead to the discovery of data quality issues. The ACT
phase is designed to create the data jobs that cleanse or correct the data. The steps involved do the following:
• Standardize, parse, and/or case the data
• Correctly identify types of data (identification analysis)
• Perform methods to remove duplicates from data sources, or to join tables with no common key
1-8
Chapter 1 Introduction to DataFlux Methodology and Course Flow
Course Tasks – Step 4
Step 4:
MONITOR: Create and implement various
types of business rules.
11
1
1
After data corrections are applied, a final phase in the DataFlux Data Management Methodology
is to set up processes to monitor various aspects of the data.
Course Data
dfConglomerate has many lines
of business and is need of an
enterprise-level view of its data.
This course works with two
lines of data:
 dfConglomerate Gift
(used in demonstrations)
 dfConglomerate Grocery
(used in exercises)
12
Gifts
Grocery
Clothing
Auto
1.2 Course Flow
1-9
Founded in 1926, dfConglomerate pursued a policy of managed growth and acquisitions to transform
itself into an organization with world-wide operations and more than 30,000 employees. Headquartered
in Toledo, Ohio, dfConglomerate has offices in London, Frankfurt, Rio de Janeiro, Tokyo, and Bangalore.
Because dfConglomerate operates across dozens of different lines of business, the reach of its products
and services encompasses everything from heavy mining and manufacturing to automobiles and
consumer goods. dfConglomerate prides itself on its decentralized management structure, with each
division having a high degree of autonomy over its own operations. Recently, corporate level executives
realized that the company’s history of mergers and acquisitions endowed it with a legacy of data silos,
incompatible systems, and bad and duplicate data. It became apparent that the company lacked the
essential enterprise-level view of its data. It also realized that to achieve this view, the company needed
to improve the overall quality of its data. To achieve this, dfConglomerate began establishing data
management centers of excellence, and used these as pilot programs with the intent of replicating them
across the enterprise.
There are five tables that exist in the dfConglomerate Gifts Microsoft Access database.
There are two tables that exist in the dfConglomerate Grocery Microsoft Access database.
1-10
Chapter 1 Introduction to DataFlux Methodology and Course Flow
Chapter 2 DataFlux Data
Management Studio: Getting Started
2.1
Introduction..................................................................................................................... 2-3
Demonstration: Navigating the DataFlux Data Management Studio Interface ...................... 2-6
Demonstration: Creating a DataFlux Data Management Studio Repository ....................... 2-20
Exercises .............................................................................................................................. 2-26
2.2
Quality Knowledge Base and Reference Sources ..................................................... 2-27
Demonstration: Verifying the Course QKB and Reference Sources ................................... 2-31
2.3
Data Connections ......................................................................................................... 2-33
Demonstration: Working with Data Connections ................................................................. 2-36
Exercises .............................................................................................................................. 2-53
2.4
Solutions to Exercises ................................................................................................. 2-54
2-2
Chapter 2 DataFlux Data Management Studio: Getting Started
2.1 Introduction
2.1 Introduction
Objectives



Provide an overview of the components of the
DataFlux Data Management Platform.
Define key components and terms of DataFlux Data
Management Studio.
Define a DataFlux Data Management Studio
repository.
3
DataFlux Data Management Platform
DataFlux Federation Server
DataFlux Data Management Server
Real-time /
Transactional
Bulk /
Batch
Virtual Access
DataFlux Authentication Server
DataFlux Data Management Studio
Desktop Client (Windows)
PLAN
4
ACT
MONITOR
On-Demand
Data Integration
2-3
2-4
Chapter 2 DataFlux Data Management Studio: Getting Started
DataFlux Data Management Studio provides a single interface for both business and IT users. Built on
DataFlux's technology foundation, DataFlux Data Management Studio provides a unified development
and delivery environment, giving organizations a single point of control to manage data quality, data
integration, master data management (MDM), or any other enterprise data initiative. DataFlux Data
Management Studio enables top-down design of business process and rules, as well as bottom-up
development and delivery of specific components in a single user environment. Building jobs and
business rules in DataFlux Data Management Studio enables transparency in the design process,
collaborative development, and change management during and after the deployment.
Working in unison with DataFlux Data Management Studio, DataFlux Data Management Server helps
form the backbone of the DataFlux Data Management Platform. The technology can implement the rules
created in DataFlux Data Management Studio in both batch and real-time environments. Through an
innovative, standards-based service-oriented architecture (SOA), DataFlux Data Management Server
frees you from writing and testing hundreds of lines of extra code to enable real-time data quality, data
integration, and data governance routines. DataFlux Data Management Server is the workhorse that
enables pervasive data quality, data integration, and master data management (MDM) throughout your
organization.
DataFlux provides integrated views of data located in multiple repositories without the need to physically
consolidate the data. This virtual integration enables powerful querying capabilities, as well as improved
source data management. Data federation efficiently joins data from many heterogeneous sources, without
moving or copying the data. The result is improved performance and speed, while reducing costly and
labor-intensive data movement and consolidation. DataFlux Federation Server enables organizations to
view data from multiple sources through an integrated, virtual data view. While the data remains stored in
its source application, multiple users can view integrated data without physically moving or changing
data. The data appears as if it were physically integrated in one place, either as a table, file, or Web call.
The DataFlux Authentication Server acts as a central point of security for users of the DataFlux Data
Management Platform. By providing authentication services, the Authentication Server helps the users of
DataFlux Data Management Studio securely connect to the DataFlux Data Management Server and/or
database servers across your enterprise.
2.1 Introduction
DataFlux Data Management Terminology
Home Tab
Toolbar
Main Menu
Navigation
Pane
Information
Pane
Navigation
Riser Bars
5
DataFlux Data Management Terminology
Primary
Tabs
Secondary Toolbar
Secondary
Tabs
Resource
Panes
Work
Area
Details Pane
6
Detach
Selected
Tab
2-5
2-6
Chapter 2 DataFlux Data Management Studio: Getting Started
Navigating the DataFlux Data Management Studio Interface
In this demonstration you explore and navigate the interface for DataFlux Data Management Studio.
1. Select Start  All Programs  DataFlux  Data Management Studio 2.2.
A splash screen appears as DataFlux Data Management Studio is initialized.
2.1 Introduction
2-7
An interactive Help window for the DataFlux Data Management Methodology also appears by
default.
2. Clear the Display Data Management Methodology on Startup check box. (This check box appears
in the right of the bottom panel of this interactive Help window.)

The DataFlux Data Management Methodology Help can be redisplayed in DataFlux Data
Management Studio by clicking anyplace in the Data Management Methodology portlet.
Data Management Methodology Portlet
3. Click
to close the DataFlux Data Management Methodology Help window.
The left side of the interface displays navigation information that includes a navigation pane and a
Navigation Risers Bar area. The right side of the interface displays the information pane for the
selected element of the navigation risers bar. The main menu and a toolbar are also on the Home tab.
2-8
Chapter 2 DataFlux Data Management Studio: Getting Started
4. Verify that the main menu and toolbar appear on the Home tab.
Home Tab
Main Menu
Toolbar
5. Verify that the left side of the interface displays navigation information.
Navigation
Pane
Navigation Area
Navigation
Riser Bars
2.1 Introduction
6. Verify the right side of the interface displays the Information pane.
Information
Pane
Notice the following:
• The Information riser bar is selected in the Navigation Riser Bars area.
• The Navigation pane displays the items for information.
• The Overview item is selected (in the Navigation pane).
• The Information pane displays the overview information.
• The Overview information area consists of six portlets.
2-9
2-10
Chapter 2 DataFlux Data Management Studio: Getting Started
7. Verify the view of the Overview information area of six portlets.
Recent Files
Methodology
Data Roundtable
Discussions
Documentation
Settings
Links
8. Click anyplace in the DataFlux Data Management Methodology portlet.
DataFlux Data Management
Methodology Portlet
The DataFlux Data Management Methodology window appears.
2.1 Introduction
2-11
a. Click the DEFINE component of the methodology.
Information about steps to perform in the Define portion of the methodology appears.
There are four main items in the Define portion of the methodology (Connect to Data, Explore
Data, Define Business Rules, Build Schemes). Selecting any of these items provides a brief
overview and a link to more in-depth help.
b. Click Explore Data.
c. Click Click here to learn more.
2-12
Chapter 2 DataFlux Data Management Studio: Getting Started
The online Help for DataFlux Data Management Studio appears for the topic of Data
Explorations.
d. Click
e. Click
to close the Help window.
to close the DataFlux Data Management Methodology Help window.
2.1 Introduction
2-13
9. In the Navigation area, click the Data riser bar.
The Data riser bar allows a user to work with collections, data connections and master data
foundations.
Collections folder A collection is a set of fields that are selected from tables that
are accessed from different data connections. A collection
provides a convenient way for a user to build a data set using
those fields. A collection can be used an an input source for
other components in Data Management Studio, such as the
Data Viewer, profiles, queries, and data exploarations.
Data Connections folder Data connections are used to access data in jobs, profiles,
data explorations and data collections.
Master Data Foundations folder The Master Data Foundation feature in Data Management
Studio uses master data projects and entity definitions to
develop the best possible record for a specific resource, such
as a customer or a product, from all of the source systems that
might contain a reference to that resource.
2-14
Chapter 2 DataFlux Data Management Studio: Getting Started
10. Click Data Connections in the Navigation pane.
Information area
displaying
information about
data connections
The information area provides overview information about all defined data connections such as
names, descriptions, and types.
Different types of data connections can be made in DataFlux Data Management Studio:
• ODBC connections
• Federated Server connections
• Localized DSN connections
• Custom connections
• SAS Data Set connections
ODBC Connection
Displays the Microsoft Windows ODBC Data Source Administrator dialog,
which you can use to create ODBC connections
Federated Server
Connection
Enables you to create a federated server data connection. A federated
connection enables you to establish a threaded connection to a database
Localized DSN
Connection
Enables you to create a localized DSN connection definition for aspecific data
source that is available via the federated server. This connection definition is
used when you access federated tables. These localized connections are
federated server connections to a DBMS that are named and created as an extra
connection to the same source in metadata
Custom
Connection
Enables you to create a custom connection string for non-ODBC connection
types. These custom strings are useful for establishing native connections to
third-party databases or drawing data from more than one type of data input
SAS Data Set
Connection
Enables you to create SAS data set connections
It is a best practice to save user credentials for your Data Connections.
11. Click
to expand the Data Connections item.
12. Click the DataFlux Sample data connection in the Navigation pane.
2.1 Introduction
The tables accessible through this data connection are revealed in the information area.
2-15
2-16
Chapter 2 DataFlux Data Management Studio: Getting Started
13. In the Navigation area, click the Data Management Servers riser bar.
In this Navigation pane, you can define new server instances or import jobs to process on the server.
a. Click
in front of Data Management Servers.
b. Click DataFlux Training Server.
The information area displays specifics about this defined server.
2.1 Introduction
14. In the Navigation area, click the Administration riser bar.
The Administration navigation pane enables you to manage various items such as repositories, the
Quality Knowledge Base (QKB), reference sources (data packs), and macro files.
2-17
2-18
Chapter 2 DataFlux Data Management Studio: Getting Started
What Is a Repository?
Data Management Studio Collection of data and metadata
Repository around DataFlux objects
The DataFlux Data Management Studio repositories
are used to do the following:
 Organize work
 Show lineage
The repositories consist
of two parts:
 Data Storage
 File Storage
13
The data storage portion of a repository can contain the following:
• Explorations and reports
• Profiles and reports
• Business rules
• Monitoring results
• Custom metrics
• Business Data information
• Master Data information
The file storage portion of a repository can contain the following:
• Data jobs
• Process jobs
• Match reports
• Entity resolution files
• Queries
• Entity Definitions
• Other files
2.1 Introduction
All Currently
Defined
Repository
Definitions
Repository
Definitions
Item Selected
Administration
Riser Bar
14
2-19
2-20
Chapter 2 DataFlux Data Management Studio: Getting Started
Creating a DataFlux Data Management Studio Repository
This demonstration illustrates the steps necessary to create a DataFlux Data Management Studio
repository.
1. If necessary, access the Administration riser bar.
a. Select Start  All Programs  DataFlux  Data Management Studio 2.2.
b. Verify that the Home tab is selected.
c. Click the Administration riser bar.
The Navigation pane shows administration
items when the Administration riser bar is
selected.
2.1 Introduction
2. Click Repository Definitions in the list of Administration items on the Navigation pane.
The information pane displays overall repository information.
3. Click New to define a new repository.
a. Type Basics Demos in the Name field.
b. Click Browse next to the Database file field.
1) In the Open window, navigate to S:\Workshop\dqdmp1.
2) Click New folder.
3) Type Demos as the value for the new folder name.
4) Press ENTER.
2-21
2-22
Chapter 2 DataFlux Data Management Studio: Getting Started
5) Double-click the new folder (Demos).
6) Type Demos.rps in the File name field.
7) Click Open.
c. Click Browse next to the Folder field in the File storage area.
1) Navigate to S:\Workshop\dqdmp1\Demos.
2) Click Make New Folder.
3) Type files as the value for the name of the new folder.
4) Press ENTER.
5) Click OK.
d. Clear Private.
2.1 Introduction
The final settings for the new repository definition should resemble the following:
e. Click OK.
A message window appears and states that the repository does not exist.
f.
Click Yes.
2-23
2-24
Chapter 2 DataFlux Data Management Studio: Getting Started
Information regarding the repository initialization appears in a window.
g. Click Close.
The repository is created and connected.
4. Update the set of default folders for this new repository.
a. Verify that the Home tab is selected.
b. Click the Folders riser bar.
c. Click
to expand the Basics Demos repository.
d. Select File  New  Folder.
1) Type output_files.
2) Press ENTER.
2.1 Introduction
e. Click the Basics Demos repository.
f.
Select File  New  Folder.
1) Type profiles_and_explorations.
2) Press ENTER.

It is a best practice to use all lowercase and no spaces for folders that might
be used on the DataFlux Data Management Server because some server
operating systems are case sensitive. Because profiles and exploration run on
a DataFlux Data Management Server, you want to adhere to this best
practice.
The final set of folders for the Basics Demos repository should resemble the following:
2-25
2-26
Chapter 2 DataFlux Data Management Studio: Getting Started
Exercises
1. Creating a Repository for Upcoming Exercises
Create a new repository to be used for the items created in the upcoming exercises. Some specifics for
the new repository follow:
• Name the repository Basics Exercises.
• Create a database file repository named S:\Workshop\dqdmp1\Exercises\Exercises.rps.
• Specify a file storage location of S:\Workshop\dqdmp1\Exercises\files.
2. Updating the Set of Default Folders for the Basics Exercises Repository
Create two additional folders for the new Basics Exercises repository:
• output_files
• profiles_and_explorations
The final set of folders should resemble the following:
2.2 Quality Knowledge Base and Reference Sources
2.2 Quality Knowledge Base and Reference Sources
Objectives


23
Define a Quality Knowledge Base (QKB).
Define reference sources.
2-27
2-28
Chapter 2 DataFlux Data Management Studio: Getting Started
What Is a Quality Knowledge Base (QKB)?
Quality Collection of files that perform the “cleansing”
Knowledge of data that could involve parsing,
Base standardization, matching, and more.
DataFlux Locale Support
24
The Quality Knowledge Base (QKB) is a collection of files and configuration settings that contain all the
DataFlux data management algorithms. DataFlux is Unicode compliant, which means that it can read data
from any language. However, DataFlux only has definitions in the Quality Knowledge Base (QKB) to
standardize, identify, fuzzy match, and so on for data from particular locales. Each QKB supports data
management operations for a specific business area. The QKB for Contact Information supports
management of commonly used contact information for individuals and organizations — for example,
names, addresses, company names, and phone numbers. The supported locales for the QKB for Contact
Information are indicated by dark blue (or black) above. Locales are continuously added, so contact your
DataFlux representative for future locale support.
In addition, the DataFlux Data Management Studio application can perform cleansing functions for noncontact information data as well. Depending on the activities you want to perform, you might need to
create your own definitions in the DataFlux Quality Knowledge Base using the Data Management
Studio – Customize module. This module enables you to modify existing definitions or create new
definitions in the QKB. (The Data Management Studio – Customize module has its own training course.)
Quality Knowledge Base locations are registered on the Administration Riser Bar in DataFlux Data
Management Studio. One QKB location needs to be designated as the default QKB location.
2.2 Quality Knowledge Base and Reference Sources
Summary
Information for
the Selected QKB
Specific QKB
Selected
Administration
Riser Bar
25
2-29
2-30
Chapter 2 DataFlux Data Management Studio: Getting Started
What Is a Reference Source?
Reference External source of information. A
Source reference object is typically a database
used to compare user data.
Current valid reference data includes the following:
 USPS data
 Canada Post data
 Geo+Phone data
 World data
26
A reference object is typically a database used by DataFlux Data Management Studio to compare user
data to a reference source (for example, USPS Address Data). You cannot directly access or modify
references.
Reference source locations are registered on the Administration riser bar in DataFlux Data Management
Studio. One reference source location of each type should be designated as the default for that type.
Summary
Information for
Reference Sources
Reference
Sources
Item Selected
Administration
Riser Bar
27
2.2 Quality Knowledge Base and Reference Sources
2-31
Verifying the Course QKB and Reference Sources
This demonstration illustrates the verification of the course QKB, as well as the reference sources that can
be defined on the course image.
1. If necessary, select Start  All Programs  DataFlux  Data Management Studio 2.2.
DataFlux Data Management Studio appears.
2. Verify that the Home tab is selected.
3. Click the Administration riser bar.
4. Click
to expand Quality Knowledge Bases.
5. Click the DataFlux QKB CI 2011A quality knowledge base.
The information area displays a summary of information.
6. Click
to expand Reference Sources.
Two reference sources are defined.
7. Click the USPS Data item.
The information area displays a summary of information.
2-32
Chapter 2 DataFlux Data Management Studio: Getting Started
8. Click the Geo+Phone Data item.
The information area displays a summary of information.
2.3 Data Connections
2.3 Data Connections
Objectives

35
Explore data connections.
2-33
2-34
Chapter 2 DataFlux Data Management Studio: Getting Started
What Is a Data Connection?
Data
Connection
Enables access to data in
DataFlux Data Management Studio.
DataFlux Data Management Studio enables five types of
data connections:
 ODBC Connection
 Federated Server Connection
 Localized DSN Connection
 Custom Connection
 SAS Data Set Connection
36
Data connections enable you to access your data in Data Management Studio from many types of data
sources.
• ODBC Connection – Displays the Microsoft Windows ODBC Data Source Administrator dialog box,
which you can use to create ODBC connections.
• Federated Server Connection – Enables you to create a federated server data connection. A federated
connection enables you to establish a threaded connection to a database.
• Localized DSN Connection – Enables you to create a localized DSN connection definition for a
specific data source that is available via the federated server. This connection definition is used when
you access federated tables. These localized connections are federated server connections to a DBMS.
They are named and created as an extra connection to the same source in metadata.
• Custom Connection – Enables you to create a custom connection string for non-ODBC connection
types. These custom strings are useful for establishing native connections to third-party databases or
drawing data from more than one type of data input.
• SAS Data Set Connection – Enables you to create SAS data set connections.
2.3 Data Connections
Data Connections
Item Selected
Summary
Information for
Data Connections
Data Riser Bar
37
Data Viewer
38
2-35
2-36
Chapter 2 DataFlux Data Management Studio: Getting Started
Working with Data Connections
This demonstration illustrates how to define and work with a data connection, including using the data
viewer and generating queries.
1. If necessary, select Start  All Programs  DataFlux  Data Management Studio 2.2.
DataFlux Data Management Studio appears.
2. Verify that the Home tab is selected.
3. Click the Data riser bar.
4. Click
to expand Data Connections.
5. Define a new data connection.
a. Click the Data Connections item in the navigation area.
b. In the information area for the selected item, click New Data Connection.
c. Click ODBC Connection.
The ODBC Data Source Administrator window appears.
d. Click the System DSN tab.
e. Click Add.
f.
Click the SQL Server driver.
g. Click Finish to close the Create New Data Source window.
The Create a New Data Source to SQL Server window appears.
2.3 Data Connections
h. Type DataFlux Training in the Name field.
i.
Select SASBAP\SQLEXPRESS as the value for the Server field.
j.
Click Next.
1) Click With SQL Server authentication using a login ID and password entered by the
user.
2) Type sa in the Login ID field.
3) Type sas in the Password field.
k. Click Next. No changes are needed.
l.
Click Next. No changes are needed.
m. Click Finish.
2-37
2-38
Chapter 2 DataFlux Data Management Studio: Getting Started
A summary of the settings appears.
n. Click Test Data Source.
o. Click OK to close the test window.
p. Click OK to close the setup window.
2.3 Data Connections
2-39
The new data source appears on the System DSN tab of the ODBC Data Source Administrator
window.
q. Click OK to close the ODBC Data Source Administrator window.
The new data source appears in the Data Connection navigation area.
2-40
Chapter 2 DataFlux Data Management Studio: Getting Started
6. Investigate the new data connection named DataFlux Training.
a. Click
to expand DataFlux Training.
The dbo, INFORMATION_SCHEMA, and sys files are structures created and maintained by
SQL Server. It is desired to only show valid and usable data sources in DataFlux Data
Management Studio.
2.3 Data Connections
b. Right-click DataFlux Training and select Filter.
1) Click Local filter.
2) Click Schema name.
3) Type df_gifts, df_grocery in the Name field.
4) Click OK.
Only the two valid, usable data sources are displayed.
2-41
2-42
Chapter 2 DataFlux Data Management Studio: Getting Started
7. If necessary, save the connection information for the two dfConglomerate data connections.
a. Click the Data Connections item in the navigation pane.
The information area displays information for all data connections.
b. Right-click dfConglomerate Gifts and select Save User Credentials.
c. Right-click dfConglomerate Grocery and select Save User Credentials.

This action saves the data source connection information. If the data source connection
requires a username and password, you are prompted for that information. It is a
recommended best practice to save connection information.
8. Use the data viewer to explore data.
a. In the Data Connections navigation area, click
connection.
to expand the dfConglomerate Gifts data
2.3 Data Connections
b. Click the Customers table object.
The information area is populated with information about the selected table.
c. Click the Data tab.
2-43
2-44
Chapter 2 DataFlux Data Management Studio: Getting Started
d. Click the
tool on the Data tab.
A Sort window appears.
1) Click the JOB TITLE field in the Available Fields pane.
2) Click
to move the JOB TITLE field to the Selected Fields pane.
3) Select Descending in the Direction field.
4) Click OK.
2.3 Data Connections
The data is now displayed in sorted order (by descending JOB TITLE).
This symbol indicates that a sort
occurred and the sort direction.
9. Click the Fields tab.
This tab can be used to examine the information displayed for each field in the selected table.
2-45
2-46
Chapter 2 DataFlux Data Management Studio: Getting Started
10. Create a stand-alone query using the Customers table.
a. In the Navigation area, right-click the Customers table and select Query Table.
b. Type Ch2D4_CustomersMissingEmail_Query in the Name field.
c. If necessary, navigate to Basics Demos  profiles_and_explorations in the Save in field.
d. Click OK.
The new query appears on a tab.
2.3 Data Connections
The query can be further customized in this interface:
Query Flow tab
Nodes
and
Data
riser bars
Query Flow Editor
Details Pane
SQL Navigation
panel
2-47
2-48
Chapter 2 DataFlux Data Management Studio: Getting Started
11. Change the orientation of the flow by clicking
 Left To Right.
12. Add a WHERE clause to the generated SQL query.
a. Locate the Nodes riser bar.
b. Double-click the Where keyword under the Clauses grouping.
The flow resembles the following:
c. Right-click the newly added Where node and select Properties.
d. Click
in the first Operand field and select Select Field.
The Select Field window appears.
2.3 Data Connections
1) Click
to expand the Customers table.
2) Click EMAIL.
The Select Field window displays a sample of the selected field’s values.
3) Click OK.
e. Click
in the Operator field and select IS NULL.
13. Select File  Save to save the query.
2-49
2-50
Chapter 2 DataFlux Data Management Studio: Getting Started
14. Select Actions  Preview to preview the result set.
The Details pane displays the Preview tab.
15. Click the Code tab on the Details pane to view the generated SQL code.
16. Select View  Show Details Pane to toggle off the Details pane.
2.3 Data Connections
17. Click the Code sub-tab.
18. Select View  Editor Type  Code.
A Warning window appears.
19. Click Yes.
The editor changes from the Visual style to the Code style.
20. Click
to expand the dfConglomerate Gifts.
21. Click the Customers table.
2-51
2-52
Chapter 2 DataFlux Data Management Studio: Getting Started
22. Click the Functions riser bar.
There are many types of available functions that can be used to customize the SQL Query further.
23. Select File  Close to close the query.
2.3 Data Connections
Exercises
3. Viewing and Graphing a Database Table
Access the BREAKFAST_ITEMS table in the dfConglomerate Grocery data connection.
• View the data on the Data tab and perform a sort by the columns BRAND and NAME.
• Review the field attributes on the Fields tab.
QUESTION:
How many fields are in the BREAKFAST_ITEMS table?
Answer:
QUESTION:
How many fields are character?
Answer:
QUESTION:
How many fields are numeric?
Answer:
• Create a graph using the Graph tab with the following specifications:
Chart type:
Area
X axis:
NAME
Y axis:
SIZE
QUESTION: What NAME value has the highest value (or sum of values) of SIZE
for the sample defined by the default “row count range” of 30?
Answer:
2-53
2-54
Chapter 2 DataFlux Data Management Studio: Getting Started
2.4 Solutions to Exercises
1. Creating a Repository for Upcoming Exercises
a. If necessary, select Start  All Programs  DataFlux  Data Management Studio 2.2.
b. Verify that the Home tab is selected.
c. Click the Administration riser bar.
d. Click Repository Definitions in the list of Administration items on the Navigation pane.
e. Click New to define a new repository.
1) Type Basics Exercises in the Name field.
2) Click Browse next to the Database file field.
a) In the Open window, navigate to S:\Workshop\dqdmp1.
b) Click New folder.
c) Type Exercises as the new folder name.
d) Press ENTER.
e) Double-click the new folder (Exercises).
f) Type Exercises.rps in the File name field.
g) Click Open.
3) Click Browse next to the Folder field in the File Storage area.
a) Navigate to S:\Workshop\dqdmp1\Exercises.
b) Click Make New Folder.
c) Type files as the name of the new folder.
d) Press ENTER.
e) Click OK.
4) Clear Private.
5) Click OK. A message window appears and states the repository does not exist.
6) Click Yes.
Information regarding the repository initialization appears in a window:
2.4 Solutions to Exercises
7) Click Close.
The repository is created and connected:
2. Updating the Set of Default Folders for the Basics Exercises Repository
a. If necessary, select Start  All Programs  DataFlux  Data Management Studio 2.2.
b. Verify that the Home tab is selected.
c. Click the Folders riser bar.
d. Click
to expand the Basics Exercises repository.
e. Select File  New  Folder.
1) Type output_files.
2) Press ENTER.
f. Click the Basics Exercises repository.
g. Select File  New  Folder.
1) Type profiles_and_explorations.
2) Press ENTER.
The final set of folders for the Basics Exercises repository should resemble the following:
2-55
2-56
Chapter 2 DataFlux Data Management Studio: Getting Started
3. Viewing and Graphing a Database Table
a. If necessary, select Start  All Programs  DataFlux  Data Management Studio 2.2.
b. Verify that the Home tab is selected.
c. Click the Data riser bar.
d. Click
to expand the Data Connections item.
e. Click
to expand the dfConglomerate Grocery data connection.
f. Click the BREAKFAST_ITEMS table object.
The information area is populated with information about the selected table.
2.4 Solutions to Exercises
g. Click the Data tab.
h. Sort the data by BRAND and NAME.
1) Click the
tool on the Data tab.
A Sort window appears.
2) Click the BRAND field in the Available Fields pane.
3) Click
to move the BRAND field to the Selected Fields pane.
4) Click the NAME field on the Available Fields pane.
5) Click
6) Click OK.
to move the NAME field to the Selected Fields pane.
2-57
2-58
Chapter 2 DataFlux Data Management Studio: Getting Started
The data is displayed in sorted order (by BRAND and then by NAME).

The 1 designates a primary ascending sort.
The 2 designates a secondary ascending sort.
i. Click the Fields tab.
QUESTION: How many fields are there in the BREAKFAST_ITEMS table?
Answer:
Fifteen (15)
QUESTION: How many fields are character?
Answer:
Seven (7)
QUESTION: How many fields are numeric?
Answer:
Eight (8)
2.4 Solutions to Exercises
j. Click the Graph tab.
1) Click
in the Chart type field and select Area.
2) Click
in the X-axis field and select NAME.
3) Click
in the Y-axis field and select SIZE.
QUESTION: What NAME value has the highest value (or sum of values) of SIZE
for the sample defined by the default “row count range” of 30?
Answer:
MALT-O-MEAL HONEY BUZZERS (21.8)
2-59
2-60
Chapter 2 DataFlux Data Management Studio: Getting Started
Chapter 3 PLAN
3.1
Creating Data Collections .............................................................................................. 3-3
Demonstration: Creating Collections of Address and Note Fields ........................................ 3-5
Exercises .............................................................................................................................. 3-13
Demonstration: Creating and Exploring a Data Exploration ............................................... 3-18
3.3
Creating Data Profiles .................................................................................................. 3-32
Demonstration: Creating and Exploring a Data Profile ........................................................ 3-39
Exercises .............................................................................................................................. 3-53
Demonstration: Profiling Data Using Text File Input ........................................................... 3-56
Demonstration: Profiling Data Using Filtered Data and an SQL Query .............................. 3-61
Demonstration: Profiling Data Using a Collection ............................................................... 3-68
Exercises .............................................................................................................................. 3-70
Demonstration: Data Profiling – Additional Analysis (Self-Study) ........................................ 3-73
3.4
Designing Data Standardization Schemes ................................................................. 3-83
Demonstration: Creating a Phrase Standardization Scheme ............................................. 3-85
Demonstration: Creating an Element Standardization Scheme ......................................... 3-92
Demonstration: Importing a Scheme from a Text File ......................................................... 3-96
Exercises .............................................................................................................................. 3-98
3.5
Solutions to Exercises ................................................................................................. 3-99
3-2
Chapter 3 PLAN
3.1 Creating Data Collections
3.1 Creating Data Collections
Objectives

Define a data collection.
3
What Is a Data Collection?
Data Collection
A set of data fields from different table
of different data connections
A data collection
 provides a convenient way to build a data source
using desired fields
 can be used as an input source for Data Viewer,
explorations, profiles, and queries.
4
4
3-3
3-4
Chapter 3 PLAN
Selected Collection
Two Collections
in Basics Demos
Repository
Data Riser Bar
5
5
You can use data collections to group data fields from different tables and/or database connections.
These collections can be used as input sources for data explorations and profiles.
3.1 Creating Data Collections
Creating Collections of Address and Note Fields
This demonstration illustrates the steps necessary to create a data collection. Two collections are built:
one with address information and one with notes.
1. If necessary, select Start  All Programs  DataFlux  Data Management Studio 2.2.
DataFlux Data Management Studio appears.
2. Verify that the Home tab is selected.
3. Click the Data riser bar.
4. Click
to expand the Collections grouping.
5. Click the Basics Demos repository.
6. Create a new collection of address information.
a. Click
(or right-click Basics Demos and select New Collection).
b. Type Address Info in the Name field.
c. Type Collection of Address Fields in the Description field.
d. Click OK.
The new collection appears on a tab.
3-5
3-6
Chapter 3 PLAN
7. Insert various address fields.
(the Insert Fields tool).
a. Click
The Insert Fields window appears.
b. Add address fields from the dfConglomerate Gifts data connection.
1) Click
next to the Connection field.
2) Click dfConglomerate Gifts.
3) Click OK.
4) Add fields from the Customers table.
a) Click
in front of the Customers table.
b) Click the ADDRESS field. (Click the check box.)
c) Click on the ADDRESS field name.
A data sample of the ADDRESS field appears on the right of the Insert Fields window.
d) Click the CITY field. (Click the check box.)
e) Click the COUNTRY/REGION field. (Click the check box.)
f) Click the STATE/PROVINCE field. (Click the check box.)
g) Click the ZIP/POSTAL CODE field. (Click the check box.)
3.1 Creating Data Collections
All the desired fields from the Customers table are selected.
h) Click Add.
i)
Clear the selections for the Customers table.
5) Add fields from the Employees table.
a) Click
in front of the Employees table.
b) Click the ADDRESS field. (Click the check box.)
c) Click the CITY field. (Click the check box.)
d) Click the COUNTRY/REGION. (Click the check box.)
e) Click the STATE/PROVINCE field. (Click the check box.)
f) Click the ZIP/POSTAL CODE field. (Click the check box.)
g) Click Add.
h) Clear the selections for the Employees table.
3-7
3-8
Chapter 3 PLAN
6) Add fields from the Orders table.
a) Click
in front of Orders table.
b) Click the SHIP ADDRESS field. (Click the check box.)
c) Click the SHIP CITY field. (Click the check box.)
d) Click the SHIP COUNTRY/REGION field. (Click the check box.)
e) Click the SHIP STATE/PROVINCE field. (Click the check box.)
f) Click the SHIP ZIP/POSTAL CODE field. (Click the check box.)
g) Click Add.
h) Clear the selections for the Orders table.
c. Add address fields from the dfConglomerate Grocery data connection.
1) Click
next to the Connection field.
2) Click dfConglomerate Grocery.
3) Click OK.
4) Add fields from the MANUFACTURERS table.
a) Click
in front of the MANUFACTURERS table.
b) Click the CITY field. (Click the check box.)
c) Click the CONTACT_ADDRESS field. (Click the check box.)
d) Click the CONTACT_CITY field. (Click the check box.)
e) Click the CONTACT_CNTRY field. (Click the check box.)
f) Click the CONTACT_POSTAL_CD field. (Click the check box.)
g) Click the CONTACT_STATE_PROV field. (Click the check box.)
h) Click the COUNTRY field. (Click the check box.)
i)
Click the POSTAL CD field. (Click the check box.)
j)
Click the STATE_PROV field. (Click the check box.)
k) Click the STREET_ADDR field. (Click the check box.)
l)
Click Add.
5) Click Close to close the Insert Fields window.
3.1 Creating Data Collections
The Address Info collection should now resemble the following:
8. Select File  Close Collection.
9. Create a new collection of note information.
a. Click the Basics Demos repository.
b. Click
(or right-click Basics Demos and select New Collection).
c. Type Descriptions in the Name field.
d. Type Collection of Descriptive Fields in the Description field.
e. Click OK.
3-9
3-10
Chapter 3 PLAN
The new collection appears on a tab.
10. Insert various address fields.
a. Click
(the Insert Fields tool).
The Insert Fields window appears.
b. Click the newly defined Descriptions collection in the Navigation pane.
c. Click Insert Fields in the information pane for the Descriptions collection.
The Insert Fields window appears.
d. Add notes fields from the dfConglomerate Gifts data connection.
1) Click
next to the Connection field.
2) Click dfConglomerate Gifts.
3) Click OK.
4) Add fields from the Customers table.
a) Click
in front of the Customers table.
b) Click the NOTES field. (Click the check box.)
c) Click Add.
d) Clear the selections for the Customers table.
5) Add fields from the Employees table.
a) Click
in front of the Employees table.
b) Click the NOTES field. (Click the check box.)
c) Click Add.
d) Clear the selections for the Employees table.
6) Add fields from the Orders table.
a) Click
in front of the Orders table.
b) Click the NOTES field. (Click the check box.)
c) Click Add.
d) Clear the selections for the Orders table.
3.1 Creating Data Collections
e. Add notes fields from the dfConglomerate Grocery data connection.
1) Click
next to the Connection field.
2) Click dfConglomerate Grocery.
3) Click OK.
4) Add fields from the MANUFACTURERS table.
a) Click
in front of the MANUFACTURERS table.
b) Click the NOTES field. (Click the check box.)
c) Click Add.
5) Click Close to close the Insert Fields window.
The Descriptions collection should now resemble the following:
11. Select File  Close Collection.
The two collections are displayed for the Basics Demos repository.
3-11
3-12
Chapter 3 PLAN
3.1 Creating Data Collections
3-13
Exercises
1. Creating a Dates Collection
Create a collection named Dates and add the date fields from all the tables in the dfConglomerate
Gifts and dfConglomerate Grocery data connections. Add the collection to the Basics Exercises
repository.
QUESTION: How many date fields were added to the collection?
Answer:
2. Creating a Phone Numbers Collection
Create a collection named Phone Numbers and add the various phone number fields from all the
tables in the dfConglomerate Gifts and dfConglomerate Grocery data connections. Add the collection
to the Basics Exercises repository.
QUESTION: How many phone number fields were added to the collection?
Answer:
3-14
Chapter 3 PLAN
3.2 Designing Data Explorations
Objectives

14
Define and investigate data explorations.
3.1 Creating Data Collections
3-15
What Is a Data Exploration?
Data Exploration


A data exploration reads data from
databases and categorizes the fields
in the selected tables into categories.
Data explorations enable you and your organization to
identify data redundancies, and extract and organize
metadata from multiple sources.
Relationships between metadata can be identified and
cataloged into types of data by specified business
data types and processes.
15
A data exploration reads data from databases and categorizes the fields in the selected tables into
categories. These categories are predefined in the Quality Knowledge Base (QKB). Data explorations
perform this categorization by matching column names. You also have the option of sampling the data in
the table to determine whether the data is one of the specific types of categories in the QKB.
For example, your customer metadata might be grouped into one catalog and your address metadata might
be grouped in another catalog. After you organize your metadata into manageable chunks, you can
identify relationships between the metadata by table-level profiling. Creating a data exploration enables
you to analyze tables in databases to locate potential matches and plan for the profiles that you need to
run.
After you identify possible matches, you can plan the best way to handle your data and create a profile job
for any database, table, or field. Thus, you can use a data exploration of your metadata to decide on the
most efficient and profitable way to profile your physical data.
3-16
Chapter 3 PLAN
Data Exploration Map
16
1
6
The Data Exploration map is used to quickly review the relationships between all of the databases, tables,
and fields that are included in a Data Exploration project.
Data Exploration Report – Field Match
17
1
7
The Field Match report displays a list of the database fields that match a selected field.
3.1 Creating Data Collections
3-17
Data Exploration Report – ID Analysis
18
1
8
The Identification Analysis report lists the database fields that match categories in the definitions that you
selected for field name and sample data analysis.
Data Exploration Report – Table Match
19
1
9
The Table Match report displays a list of database tables that match a selected table or field.
3-18
Chapter 3 PLAN
Creating and Exploring a Data Exploration
This demonstration illustrates creating a data exploration, and then examining the resultant report.
1. If necessary, select Start  All Programs  DataFlux  Data Management Studio 2.2.
2. Verify that the Home tab is selected.
3. Click the Folders riser bar.
4. If necessary, click
to expand the Basics Demos repository.
5. Right-click the profiles_and_explorations folder and select New  Data Exploration.
a. Type Ch3D2_dfConglomerate_Data_Exploration in the Name field.
b. Click OK.
The new data exploration appears on a tab.
3.1 Creating Data Collections
6. Verify that the Properties tab is selected.
7. Define the data exploration specifics.
a. In the Data Sources pane, click Add Table.
The Add Tables window appears.
b. Click dfConglomerate Gifts.
c. Click dfConglomerate Grocery.
d. Click Add.
e. Click Close to close the Add Tables window.
3-19
3-20
Chapter 3 PLAN
f.
Verify that seven tables appear in the Data sources area.
g. In the Analysis Methods pane, click Field name matching.
1) Click once in the Locale field. This reveals a selection tool.
2) Click
in the Locale field and select English (United States).
3) Click once in the Match Definition field. This reveals a selection tool.
4) Click
in the Match Definition field and select Field Name.
5) Click once in the Sensitivity field. This reveals a selection tool.
6) Click
in the Sensitivity field and select 75.
3.1 Creating Data Collections
h. In the Analysis Methods pane, click Field name analysis.
1) Click once in the Locale field. This reveals a selection tool.
2) Click
in the Locale field and select English (United States).
3) Click once in the Identification Definition field. This reveals a selection tool.
4) Click
i.
in the Identification Definition field and select Field Name.
In the Analysis Methods pane, click Sample data analysis.
1) Specify 500 as the value for the Sample size(records) field.
2) Click once in the Locale field. This reveals a selection tool.
3) Click
in the Locale field and select English (United States).
4) Click once in the Identification Definition field. This reveals a selection tool.
5) Click
in the Identification Definition field and select Contact Info.
8. Select Actions  Run (or click
) to execute the data exploration.
The status of the processing appears:
3-21
3-22
Chapter 3 PLAN
9. Explore the Data Exploration map results.
a. Click the Map tab.
The Field Relationship map appears:
The outer ring represents the selected data connections.
The inner ring represents a table from a data connection. If you move your cursor over a
segment of the inner ring, the table name appears.
The dots represent the fields.
3.1 Creating Data Collections
b. Locate the MANUFACTURERS table from the dfConglomerate Grocery data connection.
c. Locate the dot representing the STREET_ADDR field.
3-23
3-24
Chapter 3 PLAN
d. Click the dot representing the STREET_ADDR field.
A sample of data values appears in the panel to the right.
Four other fields are related to this STREET_ADDR field.
QUESTION: What are the names of the four other fields?
QUESTION: What tables do the four fields appear in?
Answer:
ADDRESS from the Customers table in the dfConglomerate Gifts data source
ADDRESS from the Employees table in the dfConglomerate Gifts data source
SHIP_ADDRESS from the Orders table in the dfConglomerate Gifts
data source
CONTACT_ADDRESS from the MANUFACTURERS table in the
dfConglomerate Grocery data source
3.1 Creating Data Collections
10. Explore the Data Exploration report information.
a. Click the Report tab.
b. Click
in front of All Databases.
c. Double-click the dfConglomerate Grocery data connection.
d. Click the MANUFACTURERS table.
3-25
3-26
Chapter 3 PLAN
e. Double-click the MANUFACTURERS table.
f.
Double-click the STREET_ADDR field.
The matching fields’ information appears.
g. Click the Identification Analysis riser bar.
1) Click
in front of All Identification Definitions.
2) Click
in front of Field Name.
3.1 Creating Data Collections
3-27
3) Click the category of ADDRESS.
The fields are identified as ADDRESS using the metadata identification definition display.
a) Select all fields.
(1) Click the first field.
(2) Hold down the SHIFT key.
(3) Click the last field.
4) Right-click one of the selected fields and select Add To Collection  Address Info.
3-28
Chapter 3 PLAN
5) Click Collections in the resource pane.
3.1 Creating Data Collections
6) If necessary, click
in front of Basics Demos.
7) Click the Address Info collection.
8) Click
(Open tool).
9) Verify that the “address” fields are part of this collection.
10) Select File  Close Collection.
3-29
3-30
Chapter 3 PLAN
h. Click the Table Match riser bar.
1) Click
in front of All Databases.
2) Click
in front of dfConglomerate Grocery.
3) Click the MANUFACTURERS table.
3.1 Creating Data Collections
4) Click the Customers table.
The possible relationships between MANUFACTURERS and Customers are displayed.
11. Select File  Close Exploration.
The tab for the data exploration is closed.
3-31
3-32
Chapter 3 PLAN
3.3 Creating Data Profiles
Objectives

Define and investigate data profiles.
28
What Is a Data Profile?
Data Profile
Provides the ability to inspect data for
errors, inconsistencies, redundancies,
and incomplete information.
When data are profiled, the following tasks can be
performed:
 Improve understanding of existing databases.
 Identify issues early in the data management process,
when they are easier and less expensive to manage.
 Determine which steps need to be taken to address
data problems.
29
2
9
3.3 Creating Data Profiles
Profile Properties
PK/FK
Analysis
Metric
Overrides
Identify Tables
within
Data Connections
to Profile
3
0
30
Redundant
Data
Analysis
Alert
Identify Columns
from
Selected Table
to Profile
Data Profiling Metrics

Selecting
Tools  Default Profile Metrics

(from the Properties tab)
enables the selection of metrics
to be calculated for all fields.
Selecting a particular field
(from the Properties tab),
and then selecting
Actions  Override Metrics
enables the default metrics
to be overridden for that field.
31

The first four metric selections take additional processing time to calculate, so be judicious in
their selection.
3-33
3-34
Chapter 3 PLAN
Data Profile Report – Standard Metrics
32
Standard metrics include information such as the following:
• Record count
• Null count
• Blank count
• Non-null count
• Calculations such as mean, median, mode, standard deviation, standard error
• Min/max value
• Pattern count
• Primary key candidate
• Nullable/percent null
• Decimal places
3.3 Creating Data Profiles
3-35
Data Profile Report – Visualizations
33
Visualizations are customized charts that you create based on your data and the metrics that you apply.
Data Profile Report – Notes
34
Notes can be added to your report to aid in your planning process.
3-36
Chapter 3 PLAN
Data Profile Report – Frequency Distributions
35
Distributions can be calculated for the frequency counts of the actual values of a field. To view the
record(s) where a particular distribution occurs, double-click on the value. The list of distribution values
can also be filtered and visualized.
3.3 Creating Data Profiles
3-37
Data Profile Report – Pattern Frequencies
36
Distributions can be calculated for the pattern frequency counts (the pattern of the value in the field). To
view the record(s) where a particular pattern distribution occurs, double-click on the pattern distribution
value. The list of pattern distribution values can also be filtered and visualized.
For a pattern, the following rules apply:
• “A” represents an uppercase letter
• “a” represents a lowercase letter
• “9” represents a digit.
3-38
Chapter 3 PLAN
Data Profile Report – Percentiles
37
Percentiles provide a numeric layout of your data at a percentile interval that you specify. The percentile
interval is specified when you set your metrics.
Data Profile Report – Outliers
38
The Outliers tab lists the top X minimum and maximum value outliers. The number of listed minimum
and maximum values are specified when you set your metrics.
3.3 Creating Data Profiles
Creating and Exploring a Data Profile
This demonstration illustrates the basics of setting the properties for a data profile, running the data
profile, and then viewing various aspects of the data profile report.
1. If necessary, select Start  All Programs  DataFlux  Data Management Studio 2.2.
DataFlux Data Management Studio appears.
2. Click the Folders riser bar.
3. If necessary, click
to expand the Basics Demos repository.
4. Right-click the profiles_and_explorations folder and select New  Profile.
a. Type Ch3D3_dfConglomerate_Profile field.
b. Click OK.
The new profile appears on a tab.
3-39
3-40
Chapter 3 PLAN
5. Define the profile properties.
a. Verify that the Properties tab is selected.
b. Click the dfConglomerate Gifts check box. An X appears in the check box.
c. Click the dfConglomerate Grocery check box. An X appears in the check box.
QUESTION: How many tables total are selected?
Answer:
Seven (7) total
QUESTION: How do you select only some of the fields for a particular table?
Answer:
Click
to expand the data connection. Then click each of the desired tables.
3.3 Creating Data Profiles
6. Review profile options.
a. Select Tools  Profile Options.
b. Review the options on the General tab.
c. Click the Charts tab.
d. Review the options on the Charts tab.
3-41
3-42
Chapter 3 PLAN
e. Click the Quality Knowledge Base tab.
f.
Review the options on the Quality Knowledge Base tab.
g. Click Cancel to close the Options window.
7. Define profile options.
a. Select Tools  Default Profile Metrics.
b. Verify that all metrics are selected.

The first four metric selections take additional processing time to calculate,
so be judicious in their selection.
c. Click OK to close the Metrics window.
8. Select File  Save Profile to save the profile to this point.
3.3 Creating Data Profiles
9. Override metrics for selected columns.
a. Click
in front of dfConglomerate Grocery.
b. Click the MANUFACTURERS table.
c. Click the ID field.
d. Hold down the CTRL key and click the NOTES field.
e. Right-click one of the selected columns and select Override Metrics.
f.
Clear the selection for Frequency distribution.
g. Clear the selection for Pattern frequency distribution.
h. Clear the selection for Percentiles.
i.
Clear the selection for Outliers.
j.
Click OK to close the Metrics window.
The check marks under the M column denote that metrics were overridden for these fields.
10. Select File  Save Profile to save the profile.
3-43
3-44
Chapter 3 PLAN
11. Select Actions  Run Profile Report.
a. Type First profile in the Description field.
b. Click OK to close the Run Profile window.
The profile is executed. The status of the execution is displayed.
The Report tab becomes active.
12. Review the Profile report.
a. Click
in front of dfConglomerate Gifts.
3.3 Creating Data Profiles
3-45
b. Click the Customers table.

Some metrics display (not applicable), which indicates that the metric calculation is not
applicable to that field type.
3-46
Chapter 3 PLAN
c. Click
in front of the Customers table.
d. Click the COUNTRY/REGION field.
There are 63 values.
There are 4 unique values.
There are 4 patterns to the values.
e. Click the Frequency Distribution tab.
3.3 Creating Data Profiles
f.
Click the Pattern Frequency Distribution tab.
Almost 83% of the data has the pattern of three capital letters. It would be worthwhile to
investigate fixing the patterns of the nine observations that do not have the AAA pattern.
g. Select Insert  New Note.
1) Type Check the patterns of this field. in the Add Note window.
2) Click OK to close the Add Note window.
3) Verify the Notes tab is now active and the new note appears.
This is a field-level note.
h. Click the Outliers tab.
This tab displays the top X minimum and maximum values.
3-47
3-48
Chapter 3 PLAN
i.
Add a table-level note.
1) Click the Customers table.
2) Select Insert  New Note.
3) Type Table level note with reminders on what to check. in the Add Note window.
4) Click OK to close the Add Note window.
5) With the Customers table selected, click the Notes tab.
3.3 Creating Data Profiles
3-49
13. Investigate visualizations.
a. With the Customers table selected, click the Visualizations tab.
b. Click
to the right of the Chart field.
c. Type Visual comparison of metrics across fields for Customer table in the Description field.
d. Verify that the Chart type is set to Bar.
e. Verify that the Data type is set to Field metrics.
f.
Click Select/unselect all under the Fields pane. (This selects all fields.)
g. Click Count, Unique Count, Null Count, Blank Count, and Pattern Count as the choices in
the Metrics pane.
The final settings should resemble the following:
h. Click OK to close the Chart Properties window.
3-50
Chapter 3 PLAN
The Chart window appears.
The chart can be saved as .jpg, .png, .gif, and .bmp.
The chart can be printed (if a valid printer connection is established).
3.3 Creating Data Profiles
i.
Right-click the background of the chart and select Chart Type.
If the desired visualization is not understandable as a bar chart, it can easily be changed by
choosing one of the other available types.
3-51
3-52
Chapter 3 PLAN
14. Investigate a different table that was also profiled in this job.
a. Click
in front of the dfConglomerate Grocery data connection.
b. Click
in front of the MANUFACTURERS table object.
c. Click the NOTES field.
The Column Profiling metrics for the NOTES field is displayed.
d. Click the Frequency Distribution tab.
Recall that this option was cleared in a metric override for this field.
Similarly, the tabs for Pattern Frequency Distribution, Percentiles, and Outliers display
(Not calculated) because these options were also cleared in the metric override for this field.
15. Select File  Close Profile.
3.3 Creating Data Profiles
3-53
Exercises
3. Profiling the Tables in dfConglomerate Grocery
Profile the two tables (MANUFACTURERS and BREAKFAST_ITEMS) in the dfConglomerate
Grocery data connection. Some specifics follow:
• Use the Basics Exercises repository to store the profile. (Use the profile_and_explorations folder.)
Provide a name of Ch3E3_dfConglomerate_Grocery_Profile.
• Calculate all metrics on all fields (initial setup).
• For the ID fields in both the BREAKFAST_ITEMS and MANUFACTURERS tables, do not
calculate frequency distribution, a pattern frequency distribution, percentiles, or outliers.
• Run the profile, and provide a description of Initial Profile.
• Review the report.
QUESTION: How many distinct values exist for the UOM field in the BREAKFAST_ITEMS
table?
Answer:
QUESTION: What are the distinct values for the UOM field in the BREAKFAST_ITEMS table?
Answer:
QUESTION: What is the ID field value for the records with PK as a value for the UOM field in the
BREAKFAST_ITEMS table?
Answer:
• Filter the list of distinct values to show only those with a frequency count more than 5.
Hint: Select View  Report Filter.
• Add a note to the UOM field. Type Standardize the UOM field. After it is added, verify that the
note appears in the UOM field.
• View the pattern frequency distribution for the PHONE field in the MANUFACTURERS table.
View the records that match the pattern (999)999-9999. Add a note saying to standardize this
PHONE field.
• Create a line graph to visually explore the comparison between the minimum length and maximum
length metrics. Compare these metrics for all fields in the MANUFACTURERS table.
3-54
Chapter 3 PLAN
4. Profiling a Selection of Tables in a DataFlux Sample
Profile all fields from four tables (Client_Info, Contacts, Product, and Product_Sales) in the
DataFlux Sample data connection. Some specifics follow:
• Use the Basics Exercises repository to store the profile. (Use the profile_and_explorations folder.)
Provide a name of Ch3E4_DataFluxSample_Profile.
• Calculate all column profiling metrics on all fields. (Do not calculate frequency distributions,
pattern frequency distribution, percentiles, and outliers on all fields.)
• Override the default metrics for the Phone and State fields of the Client_Info table. In addition,
calculate frequency distribution and pattern frequency distribution.
• Override the default metrics for the PHONE and STATE fields of the Contacts table. In addition,
calculate frequency distribution and pattern frequency distribution.
• Override the default metrics for the WEIGHT_OZS field of the Product table. In addition,
calculate frequency distribution, pattern frequency distribution, percentiles (with default interval),
and outliers (with default minimum and maximum specifications).
• Run the profile and answer the following questions:
QUESTION: How many distinct values exist for the STATE field in the Contacts table?
Answer:
QUESTION: How many distinct patterns exist for the STATE field in the Contacts table?
Answer:
QUESTION: What does a bar chart graphic (visualization) tell you about the comparison of the data
length and maximum length metrics for all fields in the Contacts table? (Do not
include the DATE, DELETE_FLG, ID, and MATCH_CD fields in the graphic
view.)
Answer:
QUESTION: What is the ID field value for the record(s) with a “bad” PHONE field pattern in the
Contacts table? Add a field level note saying to fix this value.
Answer:
• Add a table level note saying to check for patterns for the STATE and PHONE fields in the
Contacts table.
3.3 Creating Data Profiles
Data Profiling – Other Input Types
Data profiling can be performed on data connections as
well as the following:
 SQL Queries
 Text Files
 Filtered Data
 Collections
46
A sample interval can be specified for any input type. By default the sample interval is 1. However,
another interval can be specified by selecting Actions  Change Sample Interval.
3-55
3-56
Chapter 3 PLAN
Profiling Data Using Text File Input
This demonstration illustrates the steps necessary to use a text file as an input data source for a data
profile.
1. If necessary, select Start  All Programs  DataFlux  Data Management Studio 2.2.
DataFlux Data Management Studio appears.
2. Click the Folders riser bar.
3. Click the Basics Demos repository.
4. Click
 Profile.
a. Double-click the profiles_and_explorations folder. (This action makes this folder the value of
the Save in field.)
b. Type Ch3D4_TextFile_Profile in the Name field.
c. Click OK.
The new profile appears on a tab.
5. On the Properties tab, click Text Files.
This action enables the Insert Text File menu choice.
3.3 Creating Data Profiles
6. Select Insert  New Text File.
a. Type Gift Tradeshow Information in the Table name field.
b. Click OK to close the Insert Text File window.
7. Specify delimited file information.
a. Click
next to the Input file name field.
The Open window appears.
b. Navigate to S:\Workshop\dqdmp1\data\Text Files.
c. Click the file Gift_Tradeshow_List.txt.
d. Click Open.
The pathname and filename are returned to the Input file name field.
e. Click
f.
in the Text qualifier field and select " (double quotation mark).
Verify that the default delimiter is set to Comma.
g. Click the Number of rows to skip check box and verify that the default value is set to one.
3-57
3-58
Chapter 3 PLAN
h. Click Suggest under the Fields area.
1) Click the First row contains field names check box.
2) Click the Guess field types and lengths from the file content check box.
3) Click OK.
The Fields area is populated with information about the fields in the text file.
The final settings for the Delimited File Information window should resemble the following:
i.
Click OK to close the Delimited File Information window.
3.3 Creating Data Profiles
3-59
The field information is returned to the profile. The fields here can be “processed” just as you did for
a table source.
8. If necessary, select all fields by clicking
(check box) in front of Gift Tradeshow Information.
9. Select all metrics.
a. Select Tools  Default Profile Metrics.
b. Verify that all metrics are selected.
c. Click OK to close the Metrics window.
10. Select File  Save Profile to save the profile.
3-60
Chapter 3 PLAN
11. Select Actions  Run Profile Report.
a. Type Profiling a text file in the Description field.
b. Click OK to close the Run Profile window.
The profile executes. The status of the execution is displayed.
The report tab becomes active.
12. Review the profile report.
a. Click
in front of Gift Tradeshow Information to view the available fields.
b. Click the STATE field.
Notice the Pattern Count metric (9).
c. Click the Pattern Frequency Distribution tab.
d. Double-click the Aa. pattern.
This drill-through action goes back to the text file to get the records.
e. Click Close to close the drill-through window.
13. Select File  Close Profile.
3.3 Creating Data Profiles
3-61
Profiling Data Using Filtered Data and an SQL Query
This demonstration illustrates the steps necessary to access and filter data as input for a data profile.
In addition, input for a data profile is an SQL query.
1. If necessary, select Start  All Programs  DataFlux  Data Management Studio 2.2.
DataFlux Data Management Studio appears.
2. Click the Folders riser bar.
3. Click the Basics Demos repository.
4. Click
 Profile.
a. Double-click the profiles_and_explorations folder. (This action makes this folder the value of
the Save in field.)
b. Type Ch3D5_FilterAndSQLQuery_Profile in the Name field.
c. Click OK.
The new profile appears on a tab.
5. Set up the SQL query.
a. On the Properties tab, click
in front of the dfConglomerate Gifts data connection.
b. Click the Products table object.
c. Select Insert  New SQL Query.
d. Type Miscellaneous Products _SQL_Query_ in the SQL Query name field.
e. Click OK to close the Insert SQL Query window.
3-62
Chapter 3 PLAN
f.
Type a valid SQL query.
1) Type the following as the value for the SQL Query field:
Select Products.* from Products
where Products.CATEGORY = 'Miscellaneous'
2) Click Test.
g. Click OK to close the information window.
h. Click OK to close the SQL Query window.
The field information is returned to the profile. The fields here can be “processed” as you did for
a table source.
3.3 Creating Data Profiles
6. Select all fields for the SQL query results by clicking
Products _SQL_Query_.
(check box) in front of Miscellaneous
7. Set up the filtered data.
a. On the Properties tab, if necessary, click
connection.
in front of the dfConglomerate Gifts data
b. Click the Products table object.
c. Select Insert  New Filtered Table.
d. Type Miscellaneous Products _Filter_ in the Filter name field.
e. Click OK to close the Insert Filter window.
The Filter on Table window appears.
3-63
3-64
Chapter 3 PLAN
f.
Specify the filter.
1) Click
in the Field name field and select CATEGORY.
2) Click
in the Operation field and select equal to.
3) Type Miscellaneous in the Single value field.
4) Click Add Condition.
5) Click OK to close the Filter on Table window.
The field information is returned to the profile. The fields here can be “processed” as you did
for a table source.
8. Select all fields for the filter information by clicking
Products _Filter_.
(check box) in front of Miscellaneous
3.3 Creating Data Profiles
9. Select all metrics.
a. Select Tools  Default Profile Metrics.
b. Verify that all metrics are selected.
c. Click OK to close the Metrics window.
10. Select File  Save Profile to save the profile.
11. Select Actions  Run Profile Report.
a. Type Profiling filtered data and query results in the Description field.
b. Click OK to close the Run Profile window.
The profile executes. The status of the execution is displayed.
The Report tab becomes active.
12. Review the profile report.
a. Click
in front of Miscellaneous Products _SQL_Query_ to view the available fields.
b. Click
in front of Miscellaneous Products _Filter_ to view the available fields.
3-65
3-66
Chapter 3 PLAN
c. Click the CATEGORY field for Miscellaneous Products _SQL_Query_.
Notice the Unique Count metric.
d. Click the Frequency Distribution tab.
e. Click the CATEGORY field for the Miscellaneous Products _Filter_.
f.
Click the Column Profiling tab.
Notice the Unique Count metric.
3.3 Creating Data Profiles
g. Click the Frequency Distribution tab.
13. Continue to investigate the profile report (if desired).
IMPORTANT:

The data generated for both the SQL query and filter have the same results.

The filter pulled all records.

The filter was processed on the machine that the profile was run on.

The database does the filtering for an SQL query.
14. Select File  Close Profile.
3-67
3-68
Chapter 3 PLAN
Profiling Data Using a Collection
This demonstration illustrates the steps necessary to access data from a collection as input for a data
profile.
1. If necessary, select Start  All Programs  DataFlux  Data Management Studio 2.2.
DataFlux Data Management Studio appears.
2. Click the Data riser bar.
3. Click
in front of Collections to view the available collections.
4. Click
in front of Basics Demos to view the available collections.
5. Click the Descriptions collection.
6. Click
(Profile tool) from the toolbar at the top of the Data navigation pane.
a. Double-click the profiles_and_explorations folder. (This action makes this folder the value
of the Save in field.)
b. Type Ch3D6_Collection_Profile_DESCRIPTIONS in the Name field.
c. Click OK.
The new profile appears on a tab.
7. Verify that the “description” fields are selected from various data connections.
a. Verify that four tables are selected.
b. Click the Customers table and verify that the NOTES field is selected.
c. Click the Employees table and verify that the NOTES field is selected.
3.3 Creating Data Profiles
d. Click the Orders table and verify that the NOTES field is selected.
e. Click the MANUFACTURERS table and verify that the NOTES field is selected.
8. Select all metrics.
a. Select Tools  Default Profile Metrics.
b. Verify that all metrics are selected.
c. Click OK to close the Metrics window.
9. Select File  Save Profile to save the profile.
10. Select Actions  Run Profile Report.
a. Type Profiling a Collection of NOTES fields in the Description field.
b. Click OK to close the Run Profile window.
The profile executes. The status of the execution is displayed.
The Report tab becomes active.
11. Review the profile report.
a. Click
in front of dfConglomerate Gifts to view the available tables.
b. Click
in front of dfConglomerate Grocery to view the available table.
c. Click the Customers table and verify that the NOTES field was the only analyzed field.
d. Click the Employees table and verify that the NOTES field was the only analyzed field.
e. Click the Orders table and verify that the NOTES field was the only analyzed field.
f.
Click the MANUFACTURERS table and verify that the NOTES field was the only analyzed
field.
12. Continue to investigate the profile report (if desired).
13. Select File  Close Profile.
3-69
3-70
Chapter 3 PLAN
Exercises
5. Profiling a Text File
Find more information about the data contained in the Manufacturer_Contact_List.txt file.
Some specifics follow:
• Use the Basics Exercises repository to store the profile. (Add to the profile_and_explorations
folder.) Provide a name of Ch3E5_Manufacturer_TextFile_Profile.
• File attributes:
Table name:
Manufacturer_Contacts
File Type:
Delimited
Filename and location:
S:\Workshop\dqdmp1\data\Text Files\Manufacturer_Contact_List.txt
Text qualifier:
Double quotation mark
Field delimiter:
Comma
Number of rows to skip:
1
Use the Suggest feature.
(The first row contains field names.
Guess field types and lengths from file content.)
Use the Export feature (to export the file layout) for this text file. Name the exported file
S:\Workshop\dqdmp1\data\Output Files\Manufacturers_Contact_List.dfl.
Select all fields for profiling.
Select all profiling metrics.
Save and run the profile.
Fields:
•
•
•
•
3.3 Creating Data Profiles
3-71
• Answer the following questions:
QUESTION: How many patterns exist for the WORK_STATE field?
Answer:
QUESTION: How many values exist for the WORK_STATE field?
Answer:
QUESTION: What values exist for the WORK_STATE field?
Answer:
QUESTION: What do the records from the WORK_STATE of CA look like?
Answer:
QUESTION: What can be said about the comparison of the Pattern Count and Unique Count
metrics for all fields? (Hint: Create a bar chart to display the comparison.)
Answer:
3-72
Chapter 3 PLAN
Data Profile Report – Redundant Data Analysis
58
Redundant Data analysis enables you to quickly review whether there is redundant data for selected fields
and tables.

Finding redundant or duplicate records of data is discussed in a future chapter.
Data Profile Report – PK/FK Analysis
59
The Primary Key/Foreign Key analysis can be used to determine the number of common values in two
different tables to determine whether it is possible to set a primary key/foreign key relationship and thus
create a parent and child table. This analysis is useful in maintaining the referential integrity between
tables.
3.3 Creating Data Profiles
3-73
Data Profiling – Additional Analysis (Self-Study)
This demonstration investigates how to setup and review a redundant data analysis as well as a primary
key / foreign key analysis.
1. If necessary, select Start  All Programs  DataFlux  Data Management Studio 2.2.
DataFlux Data Management Studio appears.
2. Click the Folders riser bar.
3. Click
in front of Basics Demos to view the available folders.
4. Click
in front of profiles_and_explorations to view the available items.
5. Right-click the Ch3D3_dfConglomerate_Profile profile and select Open.
6. Select File  Save Profile As to save the profile.
a. Type Ch3D7_dfConglomerate_Profile_RDA_PKFK as the value for the Name field.
b. Verify the profiles_and_explorations folder is selected.
c. Click Save.
7. Click the Properties tab.
8. Establish redundant data analysis.
a. Select Actions  Analyze Redundant Data.
b. Click
to the right of the Configurations area.
c. Type Ch3D7_Compare_Overlap_of_PhoneNumber_fields in the Name field.
d. Click Match code.
e. Click
in the Match definition field and select Phone.
f.
in the Sensitivity field and select 95.
Click
g. Click
in front of dfConglomerate Gifts to view the available tables.
h. Click
in front of the Customers table to view the available fields.
i.
Double-click BUSINESS PHONE to move this field to the Selected list.
j.
Double-click HOME PHONE to move this field to the Selected list.
k. Double-click MOBILE PHONE to move this field to the Selected list.
l.
Click
in front of the Employees table to view the available fields.
3-74
Chapter 3 PLAN
m. Double-click BUSINESS PHONE to move this field to the Selected list.
n. Double-click HOME PHONE to move this field to the Selected list.
o. Double-click MOBILE PHONE to move this field to the Selected list.
The final settings should resemble the following:
p. Click OK to close the Redundant Data Analysis Configuration Properties window.
3.3 Creating Data Profiles
This instance of redundant data analysis appears in the configurations listing:
q. Click Close to close the Redundant Data Analysis Configurations window.
9. Verify that the redundant data analysis is specified at the column level.
a. If necessary, click the Properties tab.
b. Click
in front of dfConglomerate Gifts to view the available tables.
c. Click the Customers table to view the available fields.
d. Verify that a check mark appears under the R column for the three phone fields.
e. Click the Employees table to view the available fields.
f.
Verify that a check mark appears under the R column for the three phone fields.
3-75
3-76
Chapter 3 PLAN
10. Select File  Save Profile to save the profile.
11. Establish Primary Key/Foreign Key analysis.
a. If necessary, click the Properties tab.
b. If necessary, click
in front of dfConglomerate Gifts to view the available tables.
c. If necessary, click the Customers table to view the available fields.
d. Click the ID field.
e. Select Actions  Analyze Primary Key/Foreign Key Relationships.
f.
Click
g. Click
in front of dfConglomerate Gifts to view the available tables.
in front of the Orders table to view the available fields.
h. Click CUSTOMER_ID.
i.
Click Add.
3.3 Creating Data Profiles
j.
Click OK to close the Primary Key/Foreign Key Analysis window.
12. Verify that the Primary Key/Foreign Key Analysis is specified at the column level.
a. If necessary, click the Properties tab.
b. Click
in front of dfConglomerate Gifts to view the available tables.
c. Click the Customers table to view the available fields.
d. Verify that a check mark appears under the P column for the ID field.
3-77
3-78
Chapter 3 PLAN
13. Select File  Save Profile to save the profile.
14. Select Actions  Run Profile Report.
a. Type Performing Redundant Data and PK/FK analyses in the Description field.
b. Click OK to close the Run Profile window.
The profile executes. The status of the execution is displayed.
The Report tab becomes active.
15. Review the profile report for the Primary Key/Foreign Key analysis.
a. Click
in front of dfConglomerate Gifts to view the available tables.
b. Click
in front of the Customers table to view the list of columns.
3.3 Creating Data Profiles
c. Click the ID column.
d. Click the Primary Key/Foreign Key Analysis tab.
e. Verify that the match percentage is 100%.
16. Review the Redundant Data analysis.
a. Click the Redundant Data Analyses riser bar in the Resource pan e.
3-79
3-80
Chapter 3 PLAN
b. Click the Ch3D7_Compare_Overlap_of_PhoneNumber_fields item.
The main area is updated with the analysis results.
The view contains two portions: Grid Data Table and the Data Venn Diagram.
The Grid Data Table displays the relationships between the fields that were selected for analysis.
You can click on a cell to see the relationship displayed in the Venn diagram. You can also click
the Grid data radio buttons (Left outliers, Common, Right outliers) to control which of the
relationships are displayed in the table.
The colored dots in the table cells indicate the percentage of matching rows between the two
selected rows:
• Green reflects a low level of redundancy (under 20% by default).
• Yellow reflects a medium level of redundancy (between 20% and 50% by default).
• Red reflects a high level of redundancy (more than 50% by default).
These levels can be changed in the Options dialog box on the Report tab.
3.3 Creating Data Profiles
3-81
Explanation of the column grid:
These fields are from the
Customers table.
These fields are from the
Employees table.
c. Click in the cell where the BUSINESS PHONE from Customers intersects with the BUSINESS
PHONE from Orders.
3-82
Chapter 3 PLAN
d. Double-click the overlapping area.
This value for the BUSINESS PHONE was found four times in the Customers table, and once
in the Employees table.

The information can be exported to a text file or printed.
e. Click Close.
17. Select File  Close Profile.
3.4 Designing Data Standardization Schemes
3-83
3.4 Designing Data Standardization Schemes
Objectives

Define and create standardization schemes.
67
Using an Analysis Report to Build a Scheme
68
The analysis of an individual field can be counted as a whole (phrase) or based on each one of the field’s
elements. For example, the field value DataFlux Corporation is treated as two permutations if the
analysis is set as Element, but it would be treated only as one permutation if the analysis is set as Phrase.
The above analysis is a phrase analysis of the Company Name field. Similar company names are
grouped together based on the match definition and sensitivity selected for the analysis.
3-84
Chapter 3 PLAN
What Is a Standardization Scheme?
A standardization scheme takes
various spellings or representations
of a data value and lists a standard
way to consistently write this value.
69
A standardization scheme can be built based on the analysis report. When a scheme is applied, if the input
data is equal to the value in the Data column, then the data is changed to the value in the Standard
column. The standard of DataFlux was selected by the Scheme Build function because it was the
permutation with the most occurrences in the analysis report.
The following special values can be used in the Standard column:
//Remove
The matched word or phrase is removed from the input string.
%
The matched word or phrase is not updated. (This is used to show that a word or phrase is
explicitly marked for no change.)
3.4 Designing Data Standardization Schemes
3-85
Creating a Phrase Standardization Scheme
This demonstration illustrates creating a phrase standardization scheme.
1. If necessary, select Start  All Programs  DataFlux  Data Management Studio 2.2.
DataFlux Data Management Studio appears.
2. Click the Folders riser bar.
3. Click
in front of Basics Demos to view the available folders.
4. Click
in front of profiles_and_explorations to view the available items.
5. Right-click the Ch3D3_dfConglomerate_Profile profile and select Open.
6. Click the Report tab.
7. Click
in front of dfConglomerate Gifts to view the available folders.
8. Click the Customers table.
9. If necessary, click the Standard Metrics tab.
10. On the Standard Metrics tab, click the COMPANY field.
11. Select Actions  Build a Scheme.
The Report Generation window appears.
a. Under Phrase analysis, click
Build).
b. Click
in the Match definition field and select Organization (Scheme
in the Sensitivity field and select 75.
c. Click OK to close the Report Generation window.
3-86
Chapter 3 PLAN
The Scheme Builder window appears with the Report side populated with an alphabetical list of
the values found in the Frequency Distribution report. The values are grouped with similar names.
The group is determined by the selected match definition and sensitivity.
d. Select Edit  Build Scheme.
1) Verify that English (United States) is selected.
2) Click OK to close the Select Locale(s) window.
3.4 Designing Data Standardization Schemes
3-87
By default, all values are “moved” to the Scheme side of the Scheme Builder window. The multiitem groups are given a standard value, which is the most frequently occurring value with the
grouping on the report side.
e. Change the standard value of dfConglomerate Incorporated.
1) Scroll on the Scheme side and locate the group of records with the standard value of
dfConglomerated Incorporated.
2) Right-click on one of the dfConglomerate Incorporated standard values and select Edit.
a) Type dfConglomerate Inc. in the Standard field.
b) Click OK.
Notice that the change applies to all items in the group.

If a single value in a group of items needs to be changed, then select Edit 
Modify Standards Manually  Single Instance. A single value can then be
modified manually. To toggle back to the ability to change all instances in a
group, select Edit  Modify Standards Manually  All Instances.
3-88
Chapter 3 PLAN
f.
Change the standard value of Eta Technologies.
1) Scroll on the Scheme side and locate the group of records with the standard value of Eta
Technologies.
2) Right-click on one of the Eta Technologies standard values and select Edit.
a) Type ETA Computers in the Standard field.
b) Click OK.
Notice that the change applies to all items in the group.
g. Change the standard value of Farmers Insurance Grp Inc.
1) Scroll on the Scheme side and locate the group of records with the standard value of Farmers
Insurance Grp Inc.
2) Right-click on one of the Farmers Insurance Grp Inc standard values and select Edit.
a) Type Farmers Insurance Group in the Standard field.
b) Click OK.
h. Change the standard value of Safeguard Business Sys.
1) Scroll on the Scheme side and locate the group of records with the standard value of
Safeguard Business Sys.
2) Right-click on one of the Safeguard Business Sys standard values and select Edit.
a) Type Safeguard Business Systems in the Standard field.
b) Click OK.
3.4 Designing Data Standardization Schemes
i.
3-89
Change the standard value of ?.
1) Scroll on the Scheme side and locate the record with the standard value of ?.
2) Right-click on the? standard value and select Edit.
a) Type //Remove in the Standard field.

This removes the value from the field where this scheme is applied.
b) Click OK.
j.
Make the scheme case insensitive (by default).
1) Click Options.
2) Clear the Case sensitive check box.
3) Click OK.
12. Save the scheme to the default QKB.
a. Select File  Save.
b. Type Ch3D8 Company Phrase Scheme in the Name field.
c. Click Save.
13. Select File  Exit to close the Scheme Builder window.
14. If necessary, click the Report tab.
15. Click
in front of dfConglomerate Grocery to view the available folders.
16. Click the MANUFACTURERS table.
17. If necessary, click the Standard Metrics tab.
18. On the Standard Metrics tab, click the MANUFACTURER field.
19. Select Actions  Build a Scheme.
The Report Generation window appears.
a. Under Phrase analysis, click
Build).
b. Click
in the Match definition field and select Organization (Scheme
in the Sensitivity field and select 75.
c. Click OK to close the Report Generation window.
20. Access the scheme Ch3D8 Company Phrase Scheme.
a. Select File  Open.
b. Click Ch3D8 Company Phrase Scheme.
c. Click Open.
3-90
Chapter 3 PLAN
21. Select Report  Compare Report to Scheme  Highlight Unaccounted Permutations.

Values not already in the scheme are highlighted in red.
22. Update the existing scheme for a set of values.
a. At the bottom of the Report side, locate the Standard field.
b. Type Arrowhead Mills, Inc in the Standard field.
c. Locate the group of records that begin with Arrowhead Mills, Inc.
d. Click the first of these records.
e. Press and hold the SHIFT key and click the last of these records.
f.
Click Add to Scheme (at the bottom of the Report side).
3.4 Designing Data Standardization Schemes
3-91
These five values are added to the scheme. Notice that the values are no longer highlighted in red.
23. Update the existing scheme for a set of values.
a. A\ the bottom of the Report side, locate the Standard field.
b. Type Breadshop Natural Foods in the Standard field.
c. Locate the group of records that begin with Breadshop Natural Foods.
d. Click the first of these records.
e. Press and hold the SHIFT key and click the last of these records.
f.
Click Add to Scheme (at the bottom of the Report side).
24. Update the existing scheme for a set of values.
a. Locate the group of records that begin with General Mills.
b. Double-click the value General Mills.
This action populates the Standard field with the General Mills value.
c. Select the records in this group that are red.
d. Click Add to Scheme (at the bottom of the Report side).
A warning window appears and indicates that there is duplicate data.
e. Click OK to close the warning window (titled Duplicate Data).
25. Update the existing scheme for a set of values.
a. Locate the group of records that begin with Hannaford.
b. Double-click the value Hannaford Bros.
c. Select the records that begin with Hannaford.
d. Click Add to Scheme (at the bottom of the Report side).
26. Select File  Save.
27. Select File  Exit to close the Scheme Builder window.
3-92
Chapter 3 PLAN
Creating an Element Standardization Scheme
This demonstration illustrates creating an element standardization scheme.
1. If necessary, access DataFlux Data Management Studio and the Ch3D3_dfConglomerate Profile.
a. Select Start  All Programs  DataFlux  Data Management Studio 2.2.
DataFlux Data Management Studio appears.
b. Click the Folders riser bar.
c. Click
in front of Basics Demos to view the available folders.
d. Click
in front of profiles_and_explorations to view the available items.
e. Right-click the Ch3D3_dfConglomerate_Profile profile and select Open.
2. Click the Report tab.
3. Click
in front of dfConglomerate Gifts to view the available folders.
4. Click the Customers table.
5. If necessary, click the Standard Metrics tab.
6. On the Standard Metrics tab, click the ADDRESS field.
7. Select Actions  Build a Scheme.
The Report Generation window appears.
a. Click Element analysis.
b. Keep the Chop table field value set to None.

This builds a report for each individual word in the selected field (separated by a space).
c. Click OK to close the Report Generation window.
3.4 Designing Data Standardization Schemes
8. Investigate the Drive values.
a. Select Report  Find in Report.
b. Type Dr in the Find what field.
c. Select the Match whole word only check box.
d. Click Find Next.
e. Click Cancel to close the Find window.
The Report side of the Scheme Builder window has the first occurrence of Dr highlighted.
f.
Click the Dr value.
g. Select Report  Permutation Drill Through.
h. Click Close to close the Permutation Drill Through window.
3-93
3-94
Chapter 3 PLAN
i.
Add Dr, Dr., Drive, and Drive, to the scheme with a standard of Drive.
1) Type Drive in the Standard field.
2) In the report, click the Dr value. Then hold down the CTRL key, and click the Dr., Drive, and
Drive, values.
3) Click Add To Scheme.
9. Investigate and set a standard for the Avenue values.
a. Select Report  Find in Report.
b. Type Avenue in the Find what field.
c. Click the Match whole word only check box.
d. Click Up.
e. Click Find Next.
f.
Click Cancel to close the Find window.
The Report side of the Scheme Builder window has the first occurrence of Avenue highlighted.
10. Add permutations of Avenue to the scheme with a standard of Avenue.
a. Double-click the value Avenue. The Standard field is updated.
b. In the report, click the Ave value. Then hold down the CTRL key, and click Ave. and Avenue
values.
c. Click Add To Scheme.
The values with the specified standard are added to the Scheme side of the Scheme Builder window.
3.4 Designing Data Standardization Schemes
3-95
11. Save the scheme.
a. Select File  Save to save the standardization scheme in the Quality Knowledge Base (QKB).
b. Type Ch3D9 Address Element Scheme in the Name field.
c. Click Save.
12. Select File  Exit to close the Scheme Builder window.
13. Select File  Close Profile.
3-96
Chapter 3 PLAN
Importing a Scheme from a Text File
This demonstration illustrates importing a scheme from a text file.
1. If necessary, select Start  All Programs  DataFlux  Data Management Studio 2.2.
2. Select Tools  Other QKB Editors  Scheme Builder.
3. Click the English (United States) locale and click Open.
4. Select File  Import From Text File.
The Import From Text File window appears.
5. Specify settings for the Import From Text File window.
a. Click
next to the Import file name field.
1) Navigate to S:\Workshop\dqdmp1\data\Text Files.
2) Click the Training_Demo_Company_Scheme.txt file.
3) Click Open.
b. Click
in the Text qualifier field and select " (double quotation mark).
c. Type 1 in the Number of rows to skip field.
d. Verify that Comma is selected as the value for the field delimiter.
e. Click OK.
3.4 Designing Data Standardization Schemes
3-97
6. Save the scheme.
a. Select File  Save As to save the standardization scheme in the Quality Knowledge Base (QKB).
1) Click English (United States).
2) Click OK.
b. Type Ch3D10 Company Scheme (Text File) in the Name field.
c. Click Save.
7. Select File  Exit to close the Scheme Builder window.
3-98
Chapter 3 PLAN
Exercises
6. Creating a Scheme from the CONTACT_CNTRY Field.
Open the Ch3E3_dfConglomerateGrocery_Profile profile.
• Locate the CONTACT_CNTRY field in the MANUFACTURERS table.
• Verify that the distinct set of values can all be represented by USA. (Be sure to investigate the X
value to confirm this.)
• Create a scheme that translates all the values to USA.
• Save the scheme as Ch3E6 CONTACT_CNTRY Scheme.
7. Importing a Scheme from a Text File.
Import a scheme from the following text file:
S:\Workshop\dqdmp1\data\Text Files\Training_Exercise_Company_Scheme.txt
• Use the following specifications for the text file:
Text qualifier:
" (double quotation mark)
Number of rows to skip:
1
Comma
Field Delimiter:
• Save the new scheme as Ch3E7 Company Scheme (Text File).
3.5 Solutions to Exercises
3.5 Solutions to Exercises
1. Creating a Dates Collection
a. If necessary, select Start  All Programs  DataFlux  Data Management Studio 2.2.
b. Verify that the Home tab is selected.
c. Click the Collections riser bar.
d. Click the Basics Exercises repository.
e. Click
(or right-click Exercises and select New Collection).
1) Type Dates in the Name field.
2) Type Collection of Date Fields in the Description field.
3) Click OK.
f. Click the newly defined Dates collection in the Navigation pane.
g. Click Insert Fields in the information pane for the Dates collection.
The Insert Fields window appears.
h. Add date fields from the dfConglomerate Gifts data connection.
1) Click
next to the Connection field.
2) Click dfConglomerate Gifts.
3) Click OK.
4) Add fields from the Order Details table.
a) Click
in front of Order Details table.
b) Click the DATE ALLOCATED field. (Click the check box.)
c) Click Add.
d) Clear the selections for the Order Details table.
5) Add fields from the Orders table.
a) Click
in front of the Orders table.
b) Click the ORDER DATE field. (Click the check box.)
c) Click the PAID DATE field. (Click the check box.)
d) Click the SHIPPED DATE field. (Click the check box.)
e) Click Add.
3-99
3-100
Chapter 3 PLAN
i. Add date fields from the dfConglomerate Grocery data connection.
1) Click
next to the Connection field.
2) Click dfConglomerate Grocery.
3) Click OK.
4) Add fields from the BREAKFAST_ITEMS table.
a) Click
in front of the BREAKFAST_ITEMS table.
b) Click the DATE field. (Click the check box.)
c) Click Add.
5) Add fields from the MANUFACTURERS table.
a) Click
in front of the MANUFACTURERS table.
b) Click the POSTDATE field. (Click the check box.)
c) Click Add.
6) Click Close to close the Insert Fields window.
The Dates collection should now resemble the following:
QUESTION:
How many date fields were added to the collection?
Answer:
Six (6)
3.5 Solutions to Exercises
2. Creating a Phone Numbers Collection
a. If necessary, select Start  All Programs  DataFlux  Data Management Studio 2.2.
b. Verify that the Home tab is selected.
c. Click the Collections riser bar.
d. Click the Basics Exercises repository.
e. Click
(or right-click Exercises and select New Collection).
1) Type Phone Numbers in the Name field.
2) Type Collection of Phone Number Fields in the Description field.
3) Click OK.
f. Click the newly defined Phone Number collection in the Navigation pane.
g. Click Insert Fields in the information pane for the Phone Number collection.
The Insert Fields window appears.
h. Add phone number fields from the dfConglomerate Gifts data connection.
1) Click
next to the Connection field.
2) Click dfConglomerate Gifts.
3) Click OK.
4) Add fields from the Customers table.
a) Click
in front of the Customers table.
b) Click the BUSINESS PHONE field. (Click the check box.)
c) Click the FAX NUMBER field. (Click the check box.)
d) Click the HOME PHONE field. (Click the check box.)
e) Click the MOBILE PHONE field. (Click the check box.)
f) Click Add.
g) Clear the selections for the Customers table.
5) Add fields from the Employees table.
a) Click
in front of the Employees table.
b) Click the BUSINESS PHONE field. (Click the check box.)
c) Click the FAX NUMBER field. (Click the check box.)
d) Click the HOME PHONE field. (Click the check box.)
e) Click the MOBILE PHONE field. (Click the check box.)
3-101
3-102
Chapter 3 PLAN
f) Click Add.
g) Clear the selections for the Employees table.
i. Add phone number fields from the dfConglomerate Grocery data connection.
next to the Connection field.
1) Click
2) Click dfConglomerate Grocery.
3) Click OK.
4) Add fields from the MANUFACTURERS table.
a) Click
in front of the MANUFACTURERS table.
b) Click the PHONE field.
c) Click the CONTACT_PHONE field. (Click the check box.)
d) Click Add.
5) Click Close to close the Insert Fields window.
The Phone Numbers collection should now resemble the following:
QUESTION: How many phone number fields were added to the collection?
Answer:
Ten (10)
3. Profiling the Tables in dfConglomerate Grocery
a. If necessary, select Start  All Programs  DataFlux  Data Management Studio 2.2.
b. Click the Folders riser bar.
c. Click the Basics Exercises repository.
3.5 Solutions to Exercises
d. Click
3-103
 Profile.
1) Double-click the profiles_and_explorations folder. (This action makes this folder the value
of the Save in field.)
2) Type Ch3E3_dfConglomerateGrocery_Profile in the Name field.
3) Click OK.
e. Verify that the Properties tab is selected.
f. Click the dfConglomerate Grocery check box. An X appears in the check box.
g. Define profile options.
1) Select Tools  Default Profile Metrics.
2) Verify that all metrics are selected.
3) Click OK to close the Metrics window.
h. Select File  Save Profile to save the profile to this point.
i. Override metrics for two ID columns.
1) Click
in front of dfConglomerate Grocery.
2) Click the MANUFACTURERS table.
3) Click the ID field.
4) Right-click the ID column and select Override Metrics.
a) Clear the selection for Frequency distribution.
b) Clear the selection for Pattern frequency distribution.
c) Clear the selection for Percentiles.
d) Clear the selection for Outliers.
e) Click OK to close the Metrics window.
5) Click the BREAKFAST_ITEMS table.
6) Click the ID field.
7) Right-click the ID column and select Override Metrics.
a) Clear the selection for Frequency distribution.
b) Clear the selection for Pattern frequency distribution.
c) Clear the selection for Percentiles.
d) Clear the selection for Outliers.
e) Click OK to close the Metrics window.
3-104
Chapter 3 PLAN
j. Select File  Save Profile to save the profile.
k. Select Actions  Run Profile Report.
1) Type Initial profile in the Description field.
2) Click OK to close the Run Profile window.
The Report tab becomes active.
l. Review the Profile report.
QUESTION: How many distinct values exist for the UOM field in the BREAKFAST_ITEMS
table?
Answer:
to expand the BREAKFAST_ITEMS table. Click the UOM field.
Click
The Column Profiling tab shows the Unique Count metric to have a value
of 6 (six).
QUESTION: What are the distinct values for the UOM field in the BREAKFAST_ITEMS
table?
Answer:
If necessary, click
to expand the BREAKFAST_ITEMS table, and then
click the UOM field. Click the Frequency Distribution tab to display the six
unique values (OZ, CT, LB, <null value>, PK, 0Z).
QUESTION: What is the ID field value for the records with PK as a value for the UOM field
in the BREAKFAST_ITEMS table?
Answer:
If necessary, click
to expand the BREAKFAST_ITEMS table, and then
click the UOM field. Click the Frequency Distribution tab to display the six
unique values. Double-click the value PK. Notice that the ID values for these
two records are 556 and 859.
m. Filter for values of UOM with a frequency count greater than 5.
1) Verify that the Frequency Distribution tab is selected (for the selected UOM field).
2) Select View  Report Filter.
3) Under Filter condition, select Frequency count as the value for the Field type field.
4) Under Filter condition, select greater than as the value for the Field type field.
5) Type 5 (five) in the Single value field.
3.5 Solutions to Exercises
3-105
6) Click Add Condition.
7) Click OK to close the Report Filter window.
The Frequency Distribution tab is updated.
n. Add a note for the UOM field.
1) Verify that the UOM field is selected.
2) Select Insert  New Note.
3) Type Standardize the UOM field in the Add Note window.
4) Click OK to close the Add Note window.
5) Verify that the new note appears on the Notes tab.
o. View the pattern frequency distribution for the PHONE field in the MANUFACTURERS table.
1) Click
in front of the MANUFACTURERS table.
2) Click the PHONE field.
3) Click the Pattern Frequency Distribution tab.
4) Double-click the pattern (999)999-9999.
The Pattern Frequency Distribution Drill Through window appears.
5) Click Close to close the Pattern Frequency Distribution Drill Through window.
3-106
Chapter 3 PLAN
p. Add a note for the PHONE field.
1) Verify that the PHONE field is selected.
2) Select Insert  New Note.
3) Type Standardize the PHONE field in the Add Note window.
4) Click OK to close the Add Note window.
5) Verify that the new note appears on the Notes tab.
q. Investigate data via a visualization.
1) Click the MANUFACTURERS table.
2) Click the Visualizations tab.
3) Click
to the right of the Chart field.
4) Type Comparing Min and Max Lengths for all fields in the Description field.
5) Select Line as the value for the Chart type field.
6) Verify that the data type is set to Field metrics.
7) Click Select/unselect all under the Fields area. (This selects all fields.)
8) Click Minimum Length and Maximum Length as the choices under the Metrics area.
9) Click OK to close the Chart Properties window.
3.5 Solutions to Exercises
The Chart window appears.
r. Select File  Close Profile.
3-107
3-108
Chapter 3 PLAN
4. Profiling a Selection of Tables in a DataFlux Sample
a. If necessary, select Start  All Programs  DataFlux  Data Management Studio 2.2.
b. Click the Folders riser bar.
c. Click the Basics Exercises repository.
d. Click
 Profile.
1) Double-click the profiles_and_explorations folder. (This action makes this folder the value
of the Save in field.)
2) Type Ch3E4_DataFluxSample_Profile in the Name field.
3) Click OK.
e. Verify that the Properties tab is selected.
1) Click
in front of DataFlux Sample.
2) Click the check box in front of the Client_Info table. An X appears in the check box.
3) Click the check box in front of the Contacts table. An X appears in the check box.
4) Click the check box in front of the Product table. An X appears in the check box.
5) Click the check box in front of the Product_Sales table. An X appears in the check box.
f. Define profile options.
1) Select Tools  Default Profile Metrics.
2) If necessary, click Select/unselect all in the Column profiling area.
3) Verify that the top four metrics (Frequency distribution, Pattern frequency distribution,
Percentiles, and Outliers) are not selected.
4) Click OK to close the Metrics window.
g. Select File  Save Profile to save the profile to this point.
h. Override metrics for the Phone and State fields of the Client_Info and Contacts tables.
1) Click Client_Info table.
2) Click the Phone field.
3) Hold down the CTRL key and then click State.
4) Right-click the Phone column and select Override Metrics.
a) Click Frequency distribution.
b) Click Pattern frequency distribution.
c) Click OK to close the Metrics window.
5) Click the Contacts table.
3.5 Solutions to Exercises
3-109
6) Click the PHONE field.
7) Hold down the CTRL key and then click STATE.
8) Right-click the Phone column and select Override Metrics.
a) Click Frequency distribution.
b) Click Pattern frequency distribution.
c) Click OK to close the Metrics window.
i. Override metrics for the WEIGHT_OZS field of the Product table.
1) Click Product table.
2) Right-click the WEIGHT_OZS column and select Override Metrics.
a) Click Frequency distribution.
b) Click Pattern frequency distribution.
c) Click Percentiles.
d) Click Outliers.
e) Click OK to close the Metrics window.
j. Select File  Save Profile to save the profile.
k. Select Actions  Run Profile Report.
1) Type Initial profile in the Description field.
2) Click OK to close the Run Profile window.
The Report tab becomes active.
l. Review the profile report.
QUESTION: How many distinct values exist for the STATE field in the Contacts table?
Answer:
(i) Click
to expand the Contacts table.
(ii) Click the STATE field.
The Column Profiling tab shows the Unique Count metric to have a value of 61.
QUESTION: How many distinct patterns exist for the STATE field in the Contacts table?
Answer:
(i) Click
to expand the Contacts table.
(ii) Click the STATE field.
The Column Profiling tab shows the Pattern Count metric to have a value of 21.
3-110
Chapter 3 PLAN
QUESTION: What does a bar chart graphic (visualization) tell you about the comparison of the
data length and maximum length metrics for all fields? (Do not include the fields
DATE, DELETE_FLG, ID, and MATCH_CD in the graphic view.)
Answer:
(i) Click the Contacts table.
(ii) Click the Visualizations tab.
(iii) Click
to the right of the Chart field.
(iv) Type Comparing Data and Max Lengths for all fields in the Description field.
(v) Select Bar as the value for the Chart type field.
(vi) Verify that the data type is set to Field metrics.
(vii) Click Select/unselect all in the Fields area. (This selects all fields.)
(viii) Clear the selections for DELETE_FLG, ID, and MATCH_CD fields.
(ix) Click Data Length and Maximum Length as the choices in the Metrics area.
(x) Click OK to close the Chart Properties window.
The Visualizations tab now displays the following:
One field (ADDRESS) has a defined length much longer than the maximum
length. Several additional fields have data lengths longer than maximum
length used.
3.5 Solutions to Exercises
3-111
QUESTION: What is the ID field value for the record(s) with “bad” PHONE field pattern in the
CONTACTS table?
Answer:
(i) Click
to expand the Contacts table.
(ii) Click the PHONE field.
(iii) Click the Pattern Frequency Distribution tab.
(iv) Double-click the pattern (999)999-9999.
(v) Click Close.
(vi) Select Insert  New Note.
(vii) Type Be sure to correct the PHONE field pattern.
(viii) Click OK.
m. Add a table-level note.
1) Click the Contacts table.
2) Select Insert  New Note.
3) Type Be sure to standardize the STATE and PHONE fields.
4) Click OK.
n. Select File  Close Profile.
5. Profiling a Text File
a. If necessary, select Start  All Programs  DataFlux  Data Management Studio 2.2.
b. Click the Folders riser bar.
c. Click the Basics Exercises repository.
d. Click
 Profile.
1) Double-click the profiles_and_explorations folder. (This action makes this folder the value
of the Save in field.)
2) Type Ch3E5_Manufacturer_TextFile_Profile in the Name field.
3) Click OK.
3-112
Chapter 3 PLAN
e. On the Properties tab, click Text Files.
f. Select Insert  New Text File.
1) Type Manufacturer_Contacts in the Table name field.
2) Click OK to close the Insert Text File window.
g. Specify delimited file information.
1) Click
next to the Input file name field.
The Open window appears.
2) Navigate to S:\Workshop\dqdmp1\data\Text Files.
3) Click the Manufacturer_Contact_List.txt file.
4) Click Open.
5) Click
in the Text qualifier field and select " (double quotation mark).
6) Verify that the default delimiter is set to Comma.
7) Click Number of rows to skip and verify that the default value is set to one.
8) Click Suggest in the Fields area.
a) Click First row contains field names.
b) Click Guess field types and lengths from the file content.
c) Click OK.
9) Click Export in the Fields area.
a) Navigate to S:\Workshop\dqdmp1\data\Text Files.
b) Type Manufacturer_Contact_List.dfl in the File name field.
c) Click Save. (Replace the file if necessary.)
10) Click OK to close the Delimited File Information window.
The field information is returned to the profile. The fields here can be “processed” as you did
for a table source.
3.5 Solutions to Exercises
h. If necessary, select all fields by clicking
(check box) in front of Manufacturer_Contacts.
i. Select all metrics.
1) Select Tools  Default Profile Metrics.
2) Verify that all metrics are selected.
3) Click OK to close the Metrics window.
j. Select File  Save Profile to save the profile.
k. Select Actions  Run Profile Report.
1) Type Profiling a text file in the Description field.
2) Click OK to close the Run Profile window.
The profile executes. The status of the execution is displayed.
The Report tab becomes active.
3-113
3-114
Chapter 3 PLAN
QUESTION: How many patterns exist for the WORK_STATE field?
1) If necessary, click
in front of Manufacturer_Contacts to view the available fields.
2) Click the WORK_STATE field.
Answer:
Notice the Pattern Count metric (1).
QUESTION:
How many values exist for the WORK_STATE field?
Answer:
Notice the Unique Count metric (12).
QUESTION:
What values exist for the WORK_STATE field?
QUESTION:
What do the records from the WORK_STATE of CA look like?
Double-click the value CA.
3.5 Solutions to Exercises
3-115
QUESTION: What can be said about the comparison of the Pattern Count and Unique
Count metrics for all fields?
1) Click the text file Manufacturer_Contacts.
2) Click the Visualizations tab.
3) Click
to the right of the Chart field.
a) Type Comparing Pattern and Unique Count metrics for all fields in the Description
field.
b) Select Bar as the value for the Chart type field.
c) Verify that the Data type is set to Field metrics.
d) Click Select/unselect all in the Fields area.
e) Click Pattern Count and Unique Count in the Metrics area.
f) Click OK.
b. Select File  Close Profile.
3-116
Chapter 3 PLAN
6. Creating a Scheme from the CONTACT_CNTRY Field
a. If necessary, select Start  All Programs  DataFlux  Data Management Studio 2.2.
b. Click the Folders riser bar.
c. Navigate to Basics Exercises  profiles_and_explorations.
d. Double-click Ch3E3_dfConglomerateGrocery_Profile.
The profile opens on a new tab.
e. If necessary, click
in front of MANUFACTURERS to view the available fields.
f. Click the CONTACT_CNTRY field.
g. Click the Frequency Distribution tab.
h. Double-click the X value.
i. Verify that the contact information should have a CONTACT_CNTRY value of USA.
j. Click Close.
k. From the Tables riser bar, click the MANUFACTURERS table.
l. Right-click the CONTACT_CNTRY field and select Build a Scheme.
m. Accept the defaults in the Report Generation window and click OK.
n. Double-click the value USA.
o. Click all three values in the report.
p. Click Add To Scheme.
q. Select File  Save As.
1) Type Ch3E6 CONTACT_CNTRY Scheme in the Name field.
2) Click Save.
r. Select File  Exit to close the Scheme Builder window.
3.5 Solutions to Exercises
s. Select File  Close Profile to close the profile tab.
7. Importing a Scheme from a Text File
a. If necessary, select Start  All Programs  DataFlux  Data Management Studio 2.2.
b. Select Tools  Other QKB Editors  Scheme Builder.
c. Click the English (United States) locale and click Open.
d. Select File  Import From Text File.
The Import From Text File window appears.
e. Specify the settings for the Import From Text File window.
1) Click
next to the Import file name field.
a) Navigate to S:\Workshop\dqdmp1\data\Text Files.
b) Click the Training_Exercise_Company_Scheme.txt file.
c) Click Open.
2) Click
in the Text qualifier field and select " (double quotation mark).
3) Type 1 in the Number of rows to skip field.
4) Verify that Comma is selected as the value for the Field Delimiter.
5) Click OK.
f. Save the scheme.
1) Select File  Save As to save the standardization scheme in the QKB.
a) Click English (United States).
b) Click OK.
2) Type Ch3E7 Company Scheme (Text File) in the Name field.
3) Click Save.
g. Select File  Exit to close the Scheme Builder window.
3-117
3-118
Chapter 3 PLAN
Chapter 4 ACT
4.1
Introduction to Data Jobs .............................................................................................. 4-3
Demonstration: Setting DataFlux Data Management Studio Options ................................... 4-5
Demonstration: Creating and Running a Simple Data Job .................................................. 4-10
Exercises .............................................................................................................................. 4-33
4.2
Data Quality Jobs ......................................................................................................... 4-34
Demonstration: Investigating Standardizing, Parsing, and Casing ...................................... 4-38
Demonstration: Investigating Right Fielding and Identification Analysis ............................. 4-50
Exercises .............................................................................................................................. 4-61
Demonstration: Using a Standardization Definition and a Standardization Scheme ........... 4-63
4.3
Data Enrichment Jobs (Self-Study)............................................................................. 4-75
Demonstration: Performing Address Verification and Geocoding ........................................ 4-77
Exercises .............................................................................................................................. 4-88
4.4
Entity Resolution Jobs ................................................................................................. 4-90
Demonstration: Creating a Data Job to Cluster Records..................................................... 4-93
Exercises ............................................................................................................................ 4-108
Demonstration: Creating an Entity Resolution Job ............................................................ 4-112
Demonstration: Creating a Data Job to Compare Clusters (Optional) .............................. 4-138
4.5
Multi-Input/Multi-Output Data Jobs (Self-Study) ...................................................... 4-147
Demonstration: Multi-Input/Multi-Output Data Job: New Products .................................... 4-149
Demonstration: Multi-Input/Multi-Output Data Job: Customer Matches ............................ 4-159
Exercises ............................................................................................................................ 4-166
4-2
Chapter 4 ACT
4.1 Introduction to Data Jobs
4.1
Introduction to Data Jobs
Objectives


Discuss and explore the basic DataFlux Data
Management Studio options for affecting data jobs.
Create and execute a simple data job.
3
What Is a Data Job?
Data Job
4
4
Data jobs are the main way to process
data in DataFlux Data Management
Studio. Each data job specifies a set of
data-processing operations that flow from
source to target.
4-3
4-4
Chapter 4 ACT
Data Job Options
5
5
This dialog box enables you to set global options for data jobs as well as other areas of DataFlux Data
Management Studio.
4.1 Introduction to Data Jobs
Setting DataFlux Data Management Studio Options
In this demonstration, you investigate and set various DataFlux Data Management Studio options.
1. If necessary, select Start  All Programs  DataFlux  Data Management Studio 2.2.
DataFlux Data Management Studio appears.
2. Verify that the Home tab is selected.
3. Select Tools  Data Management Studio Options.
The Data Management Studio Options window appears.
4-5
4-6
Chapter 4 ACT
4. Verify that the General item is selected in the selection pane.
5. Clear the Automatically preview a selected node when the preview tab is active check box.
6. Click Job in the selection pane.
7. Click the Include node-specific notes when printing check box.
8. Click All in the Output Fields pane.
4.1 Introduction to Data Jobs
9. Click QKB in the selection pane.
10. If necessary, click
in the Default Locale field and select English (United States).
11. Click OK to save the change and close the Data Management Studio Options window.
4-7
4-8
Chapter 4 ACT
Data Inputs and Data Outputs
Data jobs always contain a data input node and a data
output node.
12
1
2
Use the nodes in the Data Inputs section to specify different types of input to a data job. Use the nodes
in the Data Outputs section to specify different types of output from a data job. Data jobs can have more
than one input node and output node.
Data Validation Node
The Data Validation node is used to filter or flag rows
according to the
specified condition(s).
13
1
3
The Data Validation node is in the Utilities grouping of nodes.
4.1 Introduction to Data Jobs
Run and Preview Tools
4-9
Preview a
selected node
from job flow.
Run Tool
14
1
4
The Run tool submits the job contained in the Data Flow Editor for processing and creates the output(s)
specified in the job.
The Preview icon initiates a preview run and displays the results at the point of the selected node.

A preview of a Data Output node does not show field name changes or deletions. This enables
the flexibility to continue your data flow after a Data Output node. In addition, previewing
a Data Output node does not create the output. You must run the data job to create the output.
Viewing Data Job Logs
Review the log after
running a data job.
15
1
5
The Log tab displays the status of the last job run.
4-10
Chapter 4 ACT
Creating and Running a Simple Data Job
In this demonstration, you create a data job that reads records from the Products table (in the
dfConglomerate Gifts data source). The data job filters the data, and then writes the filtered records
to a text file.
1. If necessary, select Start  All Programs  DataFlux  Data Management Studio 2.2.
2. Verify that the Home tab is selected.
3. Click the Folders riser bar.
4. Click the Basics Demos repository.
5. Click
 Data Job.
a. Double-click the batch_jobs folder. (This action makes this folder the value of the Save in field.)
b. Type Ch4D2_Products_Misc in the Name field.
c. Click OK.
The new data job appears on a tab.
New Data Job tab
Home tab
4.1 Introduction to Data Jobs
6. Add the Data Source node to the Data Flow Editor.
a. Verify that the Nodes riser bar is selected in the resource pane.
b. Click
in front of the Data Inputs grouping of nodes.
c. Double-click the Data Source node.
An instance of the node appears in the data flow. A properties window for the node appears.

You can access the properties window for any node by right-clicking the node
and selecting Properties.
d. Type Products Table in the Name field.
e. Click
1) Click
next to the Input table field.
in front of the dfConglomerate Gifts data source.
2) Click Products.
3) Click OK to close the Select Table window.
4-11
4-12
Chapter 4 ACT

All the fields from the table are automatically selected because the Output Fields option
was set to All in the Data Management Studio Options window.

Fields can be deselected or selected for output from the node by using the right and left
arrow keys:
Selects the highlighted field(s) for output from the node.
Selects all fields for output from the node.
Removes the highlighted field(s) from the output of the node.
Removes all fields from the output of the node.

The output names for the fields can be renamed by double-clicking the field name
and changing the name.

The order of the output fields from the node can be controlled by highlighting field(s)
and using the up and down arrow keys.
Moves the highlighted field(s) up in the output list.
Moves the highlighted field(s) down in the output list.
f.
Move the CATEGORY field so that it follows the PRODUCT NAME field.
1) Scroll in the Selected list box of Output fields and locate the CATEGORY field.
2) Click the CATEGORY field.
3) Click
seven (7) times so that it appears after PRODUCT NAME.
g. Click OK to save the changes and close the Data Source Properties window.
4.1 Introduction to Data Jobs
The Data Source node appears in the job diagram and the updated information is displayed.
7. Edit the Basic Settings for the Data Source node, and then Preview the data.
a. If necessary, click View  Show Details Pane.
b. If necessary, click the Basic Settings tab.
c. Type Data Source as the value for the Description field.
4-13
4-14
Chapter 4 ACT
d. Click the Preview tab.
e. Right-click the Data Source node and select Preview.
A sample of records appears on the Preview tab of the Details panel.
This tool toggles the
Details pane on and off.
This tool previews the data
from the selected node
Details pane with
Preview tab selected.

You must explicitly select Preview because you turned off the Auto Preview option.
4.1 Introduction to Data Jobs
8. Add a Data Validation node to the job flow.
a. In the Nodes resource pane, click
b. Click
to collapse the Data Inputs grouping of nodes.
in front of the Utilities grouping of nodes.
c. Locate the Data Validation node.
4-15
4-16
Chapter 4 ACT
d. Double-click the Data Validation node.
An instance of the node appears in the data flow. The Data Validation Properties window appears.
9. Specify properties for the Data Validation node.
a. Type Filter for Miscellaneous in the Name field.
b. Click
next to Field and select CATEGORY.
c. Click
in the Operation field and select equal to.
d. Click Single (if necessary) and then type Miscellaneous.
e. Click Add Condition. The Expression area is updated.
f.
In the Audit action area, verify that Fails is selected for Row validation and
Remove row from output is selected for as the action.
4.1 Introduction to Data Jobs
The final settings for the Data Validation node should resemble the following:
g. Click OK to close the Data Validation Properties window.
4-17
4-18
Chapter 4 ACT
10. Establish an appropriate description and preview the Data Validation node.
a. If necessary, click the Data Validation node in the job diagram.
b. Verify that the Details pane is displayed.
c. If necessary, click the Basic Settings tab.
d. Type Subset for Misc as the value for the Description field.
e. Click the Data Validation node in the data flow diagram.
f.
Click the Preview tool (
).
A sample of records appears on the Preview tab of the Details panel.
g. Verify that the CATEGORY field values are all Miscellaneous.
4.1 Introduction to Data Jobs
11. Add a Text File Output node to the job flow.
a. In the Nodes resource pane, click
b. Click
to collapse the Utilities grouping of nodes.
in front of the Data Outputs grouping of nodes.
c. Locate the Text File Output node.
4-19
4-20
Chapter 4 ACT
d. Double-click the Text File Output node.
An instance of the node appears in the data flow. The Text File Output Properties window
appears.
12. Specify properties for the Text File Output node.
a. Type Miscellaneous Products in the Name field.
b. Specify the output file information.
1) Click
next to the Output file field.
2) Navigate to S:\Workshop\dqdmp1\Demos\files\output_files.
3) Type Ch4D2_Products_Misc.txt in the File name field.
4) Click Save.
c. Specify attributes for the file.
1) Verify that the Text qualifier field is set to “ (double quotation mark).
2) Verify that the Field delimiter field is set to Comma.
3) Click Include header row.
4) Click Display file after job runs.
d. Specify only the desired fields of PRODUCT CODE and PRODUCT NAME.
1) Remove all default Selected fields by clicking
.
2) Double-click PRODUCT CODE from the Available list box.
3) Double-click PRODUCT NAME from the Available list box.
e. Rename PRODUCT CODE for the output file.
1) Double-click PRODUCT CODE in the Output Name field in the Selected list box.
4.1 Introduction to Data Jobs
2) Type PRODUCT CD.
3) Press ENTER.
f.
Add a note that states that a field name was changed.
1) Click Notes.
2) Type The PRODUCT CODE field was renamed to PRODUCT CD in the output file.
in the Notes window.
3) Click OK to close the Notes window.
The final settings for the Text File Output node should resemble the following:
g. Click OK to save the changes and close the Text File Output Properties window.
4-21
4-22
Chapter 4 ACT
13. Establish an appropriate description and preview the Text File Output node.
a. If necessary, click the Text File Output node in the job diagram.
b. Verify that the Details pane is displayed.
c. If necessary, click the Basic Settings tab.
d. Type Output to Text File as the value for the Description field.
e. Verify the note entered for the node appears.
f.
Click the Text File Output node in the data flow diagram.
g. Click the Preview tool (
).
A sample of records appears on the Preview tab of the Details panel.
h. Verify that the CATEGORY field values are all Miscellaneous.


When you preview an output node, all fields are shown and any renames of output fields
are not reflected in the preview. This is by design to enable you to write only certain
records to an output and still have all the fields for the continuation of your data flow.
Previewing does not create the output. Only when the job is run is the output physically
created.
The data job should now resemble the following:
14. Verify that the added note is reflected on the Text File Output node.
a. Locate the Notes icon in the Text File Output node.
4.1 Introduction to Data Jobs
b. Click the note icon on the Text File Output node.
c. Click
to close the notes item.
d. Add a Data Target (Insert) node to the job flow.
1) In the Nodes resource pane, verify that the Data Outputs grouping of nodes is expanded.
2) Double-click the Data Target (Insert) node.
The node appears in the data flow. The Data Target (Insert) Properties window appears.
e. Specify properties for the Data Target (Insert) node.
1) Type Insert Records on Products_Misc Table in the Name field.
2) Click
a) Click
next to the Output table field. The Select Table window appears.
in front of the dfConglomerate Gifts data source.
b) Click dfConglomerate Gifts.
c) Click
to be able to add a table to the selected data connection.
d) Type Products_Misc in the Enter a name for the new table field.
e) Click OK to close the New Table window.
f) Click OK to close the Select Table window.
4-23
4-24
Chapter 4 ACT
3) Click the Delete existing rows check box.
4) Accept the default selected output fields.
5) Click OK to save the changes and close the Data Target (Insert) Properties window.
15. Establish an appropriate description for the Data Target (Insert) node.
a. If necessary, click the Data Target (Insert) node in the job diagram.
b. Verify that the Details pane is displayed.
c. If necessary, click the Basic Settings tab.
d. Type Output to Text File as the value for the Description field.
16. Add a note to the data flow.
a. Click the Insert Note tool (
).
b. Click and drag the note to position it under the nodes.
c. Type the following as text for the note:
This job:
1. Reads the records from the Products table
2. Filters for the records where CATEGORY=Miscellaneous
3. Writes the results of two fields (one renamed) to a text file
4. Writes the results of all fields to a new table in dfConglomerate Gifts
The data job should now resemble the following:
4.1 Introduction to Data Jobs
17. Group the two output nodes.
a. Click the Text File Output node (labeled Miscellaneous Products).
b. Hold down the CTRL key and click the Data Target (Insert) node (labeled Insert Records on
Product_Misc).
c. Right-click one of the selected nodes and select Group Selected Nodes.
1) Type Output Nodes in the Name field.
2) Type Grouping the text file and data target nodes together in the Description field.
3) Click OK to close the Group Properties window.
The data job should now resemble the following:

To undo the grouping, you can right-click the grouping element and select
Ungroup Selected Groups.
4-25
4-26
Chapter 4 ACT
18. Specify properties for the job.
a. Click the Settings tab.
b. Type Create and run a data job in the Description field.
c. Type The job writes Products with the CATEGORY of Miscellaneous to a text file (just two
columns), and writes all columns (same subset of data) to a new table. in the Notes field.
19. Select File  Save.
4.1 Introduction to Data Jobs
20. Run the job.
a. Verify that the Data Flow tab is selected.
b. Select Actions  Run Data Job.
The job runs and the requested text file output is displayed in a Notepad window.
c. Select File  Exit to close the Notepad window.
d. Notice the processing information on each node.
4-27
4-28
Chapter 4 ACT
21. View the (full) Log information.
a. Click the Log tab.
b. Review the information for each of the nodes.
4.1 Introduction to Data Jobs
4-29
22. Investigate print options.
a. If necessary, click the Data Flow tab.
b. Right-click in the background of the data flow (not on a particular node) and select Print 
Print Setup.
1) Click Landscape in the Orientation area.
2) Click OK to close the Page Setup window.

The
tool found on the Data Flow toolbar can be used as an alternative to rightclicking for some actions.
4-30
Chapter 4 ACT
c. Right-click in the background of the data flow (not on a particular node) and select Print 
Print Preview.
The Print Preview window appears.
d. Scroll to the last page and verify that the node-specific note appears.

e. Click
Recall the option that was set to enable the printing of the node-specific notes.
to close the Print Preview window.
4.1 Introduction to Data Jobs
f.
4-31
Right-click in the background of the data flow (not on a particular node) and select
Clear Run Results  Clear All Run Results.
g. Right-click in the background of the data flow (not on a particular node) and select Save
Diagram As Image  Entire Diagram.
h. Navigate to S:\Workshop\dqdmp1\demos\files\output_files and click Save. (Accept the default
name of the .png file.)
i.
View the newly created .png file.
1) Access a Windows Explorer window.
2) Navigate to S:\Workshop\dqdmp1\demos\files\output_files.
3) Double-click the.png file.

The node-specific notes are not “printed” when the diagram is saved as an image.
4) Select File  Exit to close the Windows Photo Viewer.
23. Select File  Close to close the data job.
4-32
Chapter 4 ACT
24. Verify the location of the data job.
a. If necessary, click the Home tab.
b. If necessary, click the Folders riser bar.
c. Click
to expand the Basics Demos repository.
d. Click the batch_jobs folder.
e. Verify that the new data job (Ch4D2_Products_Misc) exists in this location.
4.1 Introduction to Data Jobs
4-33
Exercises
1. Creating a Simple Data Job
Create a data job that uses Data Source, Data Validation, and Text File Output nodes. The final job
flow should resemble the following:
•
•
•
•
•
Create the data job in the Basics Exercises  batch_jobs folder.
Name the data job Ch4E1_Breakfast_Items_OZ.
Read from the BREAKFAST_ITEMS table in the dfConglomerate Grocery data connection.
Filter for records where UOM is equal to OZ.
Write the results to a text file named Ch4E1_Breakfast_Items_OZ.txt in
S:\Workshop\dqdmp1\Exercises\files\output_files. Set up the file so that the data is commadelimited and includes a header row. Also, display the file after the job is processed. Move the
MANUFACTURER_ID field so that it follows the ID field.
• Add a note that explains the three nodes used in this job.
• Add an appropriate description and note to the properties/settings of the job.
• Save and run the data job.
QUESTION: How many records were selected in the filter and therefore written to the text file?
Answer:
QUESTION: How many records were read from the source table?
Answer:
4-34
Chapter 4 ACT
4.2 Data Quality Jobs
Objectives




Explore applying standardization schemes versus
standardization definitions.
Explore parsing.
Explore casing.
Explore identification analysis versus right-fielding.
24
Standardization Scheme
A scheme is a simple find-and-replace action based on
the information in the scheme file.
Data
Partial Scheme
Original Data
Standard
Street
St
St.
St
ST.
St
Rd.
Rd
Road
Rd
RD.
Rd
Standardized Data
123 North Main Street, Suite 100 123 North Main St, Suite 100
25
4591 S. Covington Road
4591 S. Covington Rd
29 WASSAU ST.
29 Wassau St
4.2 Data Quality Jobs
4-35
Standardization Definition
A standardization definition
 is more complex than a standardization scheme
 involves standardization scheme(s)
 can also parse data and apply regular expression
libraries and casing information.
Data Prior to Standardization
Definition
Data After
Standardization
Mister John Q. Smith, Junior
Name
dataflux corporation
Organization DataFlux Corp
123 North Main Street, Suite 100
Address
123 N Main St, Ste 100
U.S.
Country
UNITED STATES
9194473000
Phone
(919) 447 3000
Mr John Q Smith, Jr
26

If you standardize a data value using both a definition and a scheme, the definition is applied first
and then the scheme is applied.
Data standardization does not perform a validation of the data (for example, Address
Verification). Address verification is a separate component of the DataFlux Data Management
Studio application. (It is discussed in another chapter.)
4-36
Chapter 4 ACT
Parsing Example
Parsed Information
Street Number
123
Pre-direction
N
Address Information
Street Name
MAIN
123 N MAIN ST APT 201
Street Type
ST
Post-direction
Address Extension
APT
Address Extension Number 201
27
Parsing separates multi-part field values into multiple, single-part fields. For example, if you have a
Name field that includes the value Mr. John Q. Smith III, Esq., you can use parsing to create six separate
fields:
Name Prefix:
Mr.
Given Name:
John
Middle Name:
Q.
Family Name:
Smith
Name Suffix:
III
Name Appendage:
Esq.
4.2 Data Quality Jobs
4-37
Casing Example
Input Data
Dataflux corporation
Data After Casing
Upper
DATAFLUX CORPORATION
Lower
dataflux corporation
Proper
Dataflux Corporation
Proper(Organization)
DataFlux Corporation
28
For best results, if available, select an applicable definition when you use Proper casing.
Identification Analysis versus Right Fielding
Identification Analysis
Customer
Customer
DataFlux
John Q Smith
DataFlux Corp
Nancy Jones
Bob Brown & Sons
Identification
DataFlux
Organization
John Q Smith
Individual
DataFlux Corp
Organization
Nancy Jones
Individual
Bob Brown & Sons Organization
Right Fielding
Customer
Name
DataFlux
John Q Smith
John Q Smith
DataFlux Corp
Nancy Jones
29
Bob Brown & Sons
Company
DataFlux
DataFlux Corp
Nancy Jones
Bob Brown & Sons
Identification analysis and right fielding use the same listing of definitions from the QKB, but in different
ways. Identification analysis identifies the type of data in a field and right fielding moves the data into
fields based on its identification.
4-38
Chapter 4 ACT
Investigating Standardizing, Parsing, and Casing
In this demonstration, you create a data job that standardizes selected fields from the Customers table.
You parse and case the e-mail information and write the results to a text file.
1. If necessary, select Start  All Programs  DataFlux  Data Management Studio 2.2.
2. Verify that the Home tab is selected.
3. Click the Folders riser bar.
4. Click the Basics Demos repository.
5. Click
 Data Job.
a. Double-click the batch_jobs folder. (This action makes this folder the value of the Save in field.)
b. Type Ch4D3_Customers_DataQuality in the Name field.
c. Click OK.
The new data job appears on a tab.
6. Add the Data Source node to the Data Flow Editor.
a. Verify that the Nodes riser bar is selected in the resource pane.
b. Click
in front of the Data Inputs grouping of nodes.
c. Double-click the Data Source node.
An instance of the node appears in the data flow, and the properties window for the node appears.
d. Type Customers Table in the Name field.
e. Click
1) Click
next to the Input table field.
in front of the dfConglomerate Gifts data source.
2) Click Customers.
3) Click OK to close the Select Table window.
4.2 Data Quality Jobs
The final settings for the Data Source Properties should resemble the following:
f.
Click OK to save changes and close the Data Source Properties window.
7. Establish an appropriate description and preview the Data Source node.
a. If necessary, click the Data Source node in the job diagram.
b. Verify that the Details pane is displayed.
c. If necessary, click the Basic Settings tab.
d. Type Data Source as the value for the Description field.
e. Click the Data Source node in the data flow diagram.
f.
Click the Preview tool (
).
A sample of records appears on the Preview tab of the Details panel.
4-39
4-40
Chapter 4 ACT
8. Add a Standardization node to the job flow.
a. In the Nodes resource pane, click
b. Click
to collapse the Data Inputs grouping of nodes.
in front of the Quality grouping of nodes.
c. Locate the Standardization node.
d. Double-click the Standardization node.
The node appears in the data flow. The Standardization Properties window appears.
4.2 Data Quality Jobs
4-41
9. Specify properties for the Standardization node.
a. Type Standardize Fields in the Name field.
b. Double-click each of the following fields to move them from the Available list box to the Selected
list box.
COMPANY
BUSINESS PHONE
HOME PHONE
MOBILE PHONE
FAX NUMBER
ADDRESS
STATE/PROVINCE
ZIP/POSTAL CODE
COUNTRY/REGION
under Scheme and select Ch3D8 Company Phrase Scheme.
c. For the COMPANY field, click
under Definition and select Phone.
d. For the BUSINESS PHONE field, click
e. For the HOME PHONE field, click
f.
under Definition and select Phone.
under Definition and select Phone.
For the MOBILE PHONE field, click
g. For the FAX NUMBER field, click
h. For the ADDRESS field, click
under Definition and select Phone.
under Definition and select Address.
i.
For the STATE/PROVINCE field, click
State/Province(Abbreviation).
under Definition and select
j.
For the ZIP/POSTAL CODE field, click
under Definition and select Postal Code.
k. For the COUNTRY/REGION field, click
l.
under Definition and select Country.
Click the Preserve null values check box.
m. Click the Add standardization flag field check box.
4-42
Chapter 4 ACT
The Standardization Properties window should resemble the following:

The standardization for a field can use a standardization definition and/or a scheme. If
both are specified, the definition is applied first. Then the scheme is applied on the results
from the definition.

Selecting the Preserve null values option ensures that if a field is null when it enters the
node, then the field is null after the node. Selecting this option if the output will be
written to a database table is recommended.
n. Click OK to close the Standardization Properties window.
4.2 Data Quality Jobs
10. Establish an appropriate description and preview the Standardization node.
a. If necessary, click the Standardization node in the job diagram.
b. Verify that the Details pane is displayed.
c. If necessary, click the Basic Settings tab.
d. Type Standardization as the value for the Description field.
e. Click the Standardization node in the data flow diagram.
f.
Click the Preview tool (
).
A sample of records appears on the Preview tab of the Details panel.
g. Scroll to the right to view the _Stnd and _Stnd_flag fields.
11. Add a Parsing node to the job flow.
a. In the Nodes resource pane, verify that the Quality grouping of nodes is expanded.
b. Double-click the Parsing node.
The node appears in the data flow. The Parsing Properties window appears.
4-43
4-44
Chapter 4 ACT
12. Specify properties for the Parsing node.
a. Type Parse Email in the Name field.
b. Click
under Field to parse and select EMAIL.
c. Click
under Definition and select E-mail.
d. Click
to choose the Mailbox, Sub-Domain, and Top-Level Domain tokens.
e. Click Preserve null values.
The Parsing Properties window should resemble the following:
f.
Click OK to close the Parsing Properties window.
13. Establish an appropriate description and preview the Parsing node.
a. If necessary, click the Parsing node in the job diagram.
b. Verify that the Details pane is displayed.
c. If necessary, click the Basic Settings tab.
d. Type Parsing as the value for the Description field.
e. Click the Parsing node in the data flow diagram.
4.2 Data Quality Jobs
f.
Click the Preview tool (
4-45
).
A sample of records appears on the Preview tab of the Details panel.
g. Scroll to the right to view the Mailbox, Sub-Domain and Top-Level Domain fields.
14. Add a Change Case node to the job flow.
a. In the Nodes resource pane, verify that the Quality grouping of nodes is expanded.
b. Double-click the Change Case node.
The node appears in the data flow. The Change Case Properties window appears.
c. Type Upper Case EMAIL Fields in the Name field.
d. Double-click each of the following fields to move them from the Available list box to the Selected
list box.
EMAIL
Mailbox
Sub-Domain
Top-Level Domain
e. For each of the four selected fields, click
under Type and select Upper.
f.
under Definition and select Upper.
For each of the four selected fields, click
g. Accept the default output names.
h. Click the Preserve null values check box.
4-46
Chapter 4 ACT
The Change Case Properties window should resemble the following:
i.
Click OK to close the Change Case Properties window.
15. Establish an appropriate description and preview the Change Case node.
a. If necessary, click the Change Case node in the job diagram.
b. Verify that the Details pane is displayed.
c. If necessary, click the Basic Settings tab.
d. Type Change Case as the value for the Description field.
e. Click the Change Case node in the data flow diagram.
f.
Click the Preview tool (
).
g. Scroll to the right to view the new cased e-mail token fields.
4.2 Data Quality Jobs
16. Add a Text File Output node to the job flow.
a. In the Nodes resource pane, click
b. Click
to collapse the Quality grouping of nodes.
in front of the Data Outputs grouping of nodes.
c. Double-click the Text File Output node.
An instance of the node appears in the data flow. The Text File Output Properties window
appears.
17. Specify properties for the Text File Output node.
a. Type Customer Info in the Name field.
b. Specify the output file information.
1) Click
next to the Output file field.
2) Navigate to S:\Workshop\dqdmp1\Demos\files\output_files.
3) Type Ch4D3_CustomerInfo.txt in the File name field.
4) Click Save.
c. Specify attributes for the file.
1) Verify that the Text qualifier field is set to “ (double quotation mark).
2) Verify that the Field delimiter field is set to Comma.
3) Click the Include header row check box.
4) Click the Display file after job runs check box.
d. Accept all fields as selected fields for the output file.
e. Click OK to save the changes. Close the Text File Output Properties window.
4-47
4-48
Chapter 4 ACT
18. Establish an appropriate description and preview the Text File Output node.
a. If necessary, click the Text File Output node in the job diagram.
b. Verify that the Details pane is displayed.
c. If necessary, click the Basic Settings tab.
d. Type Output to Text File as the value for the Description field.
19. Select File  Save.
20. Run the job.
a. Verify that the Data Flow tab is selected.
b. Select Actions  Run Data Job.
c. Verify that the text file appears.
d. Select File  Exit to close the Notepad window.
e. Notice the processing information on each node.
4.2 Data Quality Jobs
21. View the log information.
a. Click the Log tab.
b. Review the information for each of the nodes.
22. Click the Data Flow tab.
23. Select File  Close.
4-49
4-50
Chapter 4 ACT
Investigating Right Fielding and Identification Analysis
In this demonstration, you examine the difference between identification analysis and right fielding.
1. If necessary, select Start  All Programs  DataFlux  Data Management Studio 2.2.
DataFlux Data Management Studio appears.
2. Verify that the Home tab is selected.
3. Click the Folders riser bar.
4. Click the Basics Demos repository.
5. Click
 Data Job.
a. Double-click the batch_jobs folder. (This action makes this folder the value of the Save in field.)
b. Type Ch4D4_RightFielding_IDAnalysis in the Name field.
c. Click OK.
The new data job appears on a tab.
6. Add the Text File Input node to the Data Flow Editor.
a. Verify that the Nodes riser bar is selected in the resource pane.
b. If necessary, click
c. Click
to collapse the Data Outputs grouping of nodes.
in front of the Data Inputs grouping of nodes.
d. Double-click the Text File Input node.
An instance of the node appears in the data flow. The properties window for the node appears.
7. Specify properties of the Text File Input node.
a. Type Prospect List in the Name field.
b. Click
next to the Input file field.
4.2 Data Quality Jobs
4-51
1) Navigate to S:\Workshop\dqdmp1\data\Text Files.
2) Click Prospect_List.txt.
3) Click Open to close the Open window.
c. Select “ (double quotation mark) for the Text qualifier field.
d. Verify that Comma is the value for the Field delimiter field.
e. Click Number of rows to skip.
f.
Verify that 1 (one) is the default value for number of rows to skip.
g. Click Import under the Fields area.
1) If necessary, navigate to S:\Workshop\dqdmp1\data\Text Files.
2) Click Prospect_List.dfl.
3) Click Open.
The final settings for the Text File Input Properties window should resemble the following:
4-52
Chapter 4 ACT
h. Click OK to save the changes and close the Text File Input Properties window.
The Text File Input node appears in the job diagram with updated information.
24. Establish an appropriate description and preview the Text File Input node.
a. If necessary, click the Text File Input node in the job diagram.
b. Verify that the Details pane is displayed.
c. If necessary, click the Basic Settings tab.
d. Type Text File Input as the value for the Description field.
e. Click the Text File Input node in the data flow diagram.
f.
Click the Preview tool (
).
A sample of records appears on the Preview tab of the Details panel.
The Contact field has a mixture of individual names as well as corporate names. Right-fielding
and Identification Analysis both used identification definitions to help correctly identify the
information found in a field.
4.2 Data Quality Jobs
4-53
8. Add a Right Fielding node to the job flow.
a. In the Nodes resource pane, click
b. Click
to collapse the Data Inputs grouping of nodes.
in front of the Quality grouping of nodes.
c. Double-click the Right Fielding node.
The node appears in the data flow. The Right Fielding Properties window appears.
9. Specify the properties of the Right Fielding node.
a. Type Right Field Contact Info in the Name field.
b. Click
for the Definition field and select Individual/Organization.
c. Under the Input fields area, double-click the Contact field to move it from the Available list box
to the Selected list box.
d. Under the Output fields area, add three new fields.
1) Click Add.
2) Click
under Identity and click Unknown.
3) Double-click the Field0 value under Output Name.
4) Type Unknown, and press ENTER.
5) Click Add.
6) Click
under Identity and click Organization.
7) Double-click the Field1 value under Output Name.
8) Type Company and press ENTER.
9) Click Add.
10) Click
under Identity and click Individual.
11) Double-click the Field2 value under Output Name.
12) Type Person and press ENTER.
e. Click Preserve null values.
4-54
Chapter 4 ACT
The final settings for the Right Fielding node should resemble the following:
4.2 Data Quality Jobs
f.
4-55
Click Additional Outputs.
1) Click the Contact field under Output fields area.
2) Click
multiple times to move the Contact field to the bottom of the list.
The ordering of the fields should be similar to the following:

The Contact field is moved to the end of the list so that it can be more easily compared to
the results of the Right Fielding node.
3) Click OK to close the Additional Outputs window.
g. Click OK to close the Right Fielding Properties window.
4-56
Chapter 4 ACT
10. Establish an appropriate description and preview the Right Fielding node.
a. If necessary, click the Right Fielding node in the job diagram.
b. Verify that the Details pane is displayed.
c. If necessary, click the Basic Settings tab.
d. Type Right Fielding as the value for the Description field.
e. Click the Right Fielding node in the data flow diagram.
f.
Click the Preview tool (
).
A sample of records appears on the Preview tab of the Details panel.
g. Scroll to the right to view the new column information.

The Right Fielding node correctly identified data values from the Contact field as either
Company (Organization) values, Person (Individual) values, or Unknown values.
11. Add an Identification Analysis node to the job flow.
a. Verify that the Quality grouping of nodes is expanded.
b. Double-click the Identification Analysis node.
The node appears in the data flow. The Identification Analysis Properties window appears.
4.2 Data Quality Jobs
12. Specify the properties for the Identification Analysis node.
a. Type Identify Contact Info in the Name field.
b. Click Contact in the Available list box in the Identification analysis fields area.
c. Click
to move Contact to the Selected list box.
d. Specify the correct identification definition for the selected Contact field.
1) Click
under Definition for the Contact field.
2) Select Individual/Organization.
3) Verify that Contact_Identity is the value for the Output Name field.
e. Click Preserve null values.
The final settings for the Identification Analysis node should resemble the following:
f.
Click OK to close the Identification Analysis Properties window.
4-57
4-58
Chapter 4 ACT
13. Establish an appropriate description and preview the Identification Analysis node.
a. If necessary, click the Identification Analysis node in the job diagram.
b. Verify that the Details pane is displayed.
c. If necessary, click the Basic Settings tab.
d. Type Identification Analysis as the value for the Description field.
e. Click the Identification Analysis node in the data flow diagram.
f.
Click the Preview tool (
).
A sample of records appears on the Preview tab of the Details panel.
g. Scroll to the right to view the new column information.
The Identification Analysis node also correctly identified data values from the Contact field as
either Organization or Individual values. Right fielding moves the data to an appropriate field, but
identification analysis creates a field with values to indicate whether each observation is an
INDIVIDUAL or ORGANIZATION value.
14. Add a Text File Output node to the job flow.
a. In the Nodes resource pane, click
b. Click
to collapse the Quality grouping of nodes.
in front of the Data Outputs grouping of nodes.
c. Double-click the Text File Output node.
An instance of the node appears in the data flow. The Text File Output Properties window
appears.
4.2 Data Quality Jobs
15. Specify properties for the Text File Output node.
a. Type Prospects RF and ID in the Name field.
b. Specify the output file information.
1) Click
next to the Output file field.
2) Navigate to S:\Workshop\dqdmp1\Demos\files\output_files.
3) Type Ch4D4_Prospects_RF_ID.txt in the File name field.
4) Click Save.
c. Specify attributes for the file.
1) Verify that the Text qualifier field is set to “ (double quotation mark).
2) Verify that the Field delimiter field is set to Comma.
3) Click the Include header row check box.
4) Click the Display file after job runs check box.
d. Accept all fields as selected fields for the output file.
e. Click OK to save the changes and close the Text File Output Properties window.
16. Establish an appropriate description for the Text File Output node.
a. If necessary, click the Text File Output node in the job diagram.
b. Verify that the Details pane is displayed.
c. If necessary, click the Basic Settings tab.
d. Type Output to Text File as the value for the Description field.
17. Save the job.
a. Click the Data Flow tab.
b. Select File  Save.
4-59
4-60
Chapter 4 ACT
18. Run the job.
a. Verify that the Data Flow tab is selected.
b. Select Actions  Run Data Job.
c. Verify that the text file appears.
d. Select File  Exit to close the Notepad window.
e. Notice the processing information on each node.
19. View the log information.
a. Click the Log tab.
b. Review the information for each of the nodes.
20. Select File  Close.
4.2 Data Quality Jobs
4-61
Exercises
2. Creating a Data Job that Involves Standardization and Parsing
Create a data job that uses the Standardization, Parsing, and Data Target (Insert) nodes. The final job
flow should resemble the following:
•
Create a new data job named Ch4E2_Manufacturers_DataQuality in the batch_jobs folder of
the Basics Exercises repository.
•
Add the MANUFACTURERS table in the dfConglomerate Grocery data connection as the data
source. Select the following fields.
ID
MANUFACTURER
CONTACT
CONTACT_ADDRESS
CONTACT_CITY
CONTACT_STATE_PROV
CONTACT_POSTAL_CD
CONTACT_CNTRY
CONTACT_PHONE
POSTDATE
•
Standardize the following fields. Accept the default names for the standardized fields.
Field Name
Definition
MANUFACTURER
Ch3D8 Company Phrase Scheme
CONTACT
Name
CONTACT_ADDRESS
Address
CONTACT_STATE_PROV
State/Province (Abbreviation)
CONTACT_CNTRY
Scheme
Ch3E6 CONTACT_CNTRY Scheme
4-62
Chapter 4 ACT
•
Parse the standardized CONTACT field using the Name parse definition. Be sure to preserve null
values. Select only three tokens and rename the output fields according to the following:
Token Name
•
Output Name
Given Name
FIRST_NAME
Middle Name
MIDDLE_NAME
Family Name
LAST_NAME
Output the standardized, parsed data to a new table named MANUFACTURERS_STND (in the
dfConglomerate Grocery data connection). If the data job runs multiple times, ensure that the
records for each run are the only records in the table. In addition, only output the following fields
with output names:
Field Name
Output Name
ID
ID
MANUFACTURER_Stnd
MANUFACTURER
FIRST_NAME
FIRST_NAME
MIDDLE_NAME
MIDDLE_NAME
LAST_NAME
LAST_NAME
CONTACT_ADDRESS_Stnd
CONTACT_ADDRESS
CONTACT_CITY
CONTACT_CITY
CONTACT_STATE_PROV_Stnd
CONTACT_STATE_PROV
CONTACT_POSTAL_CD
CONTACT_POSTAL_CD
CONTACT_CNTRY_Stnd
CONTACT_CNTRY
CONTACT_PHONE
CONTACT_PHONE
POSTDATE
POSTDATE
•
Save and run the data job.
•
Review the log and close the data job.
•
Verify that the records were written to the MANUFACTURERS_STND table in the
dfConglomerate Grocery data connection.
4.2 Data Quality Jobs
4-63
Using a Standardization Definition and a Standardization
Scheme
In this demonstration, you create a data job to standardize a text file and output the standardized
information to another text file. This output text file is used as a source for a profile. The profile reveals
that the standardization of a State field does not “correct” all values. A scheme is created to cover the
values that are an issue. The initial data job is updated to use the scheme in addition to the original
definition. The resultant text file is profiled and reveals the proper correction of the State field.
1. If necessary, select Start  All Programs  DataFlux  Data Management Studio 2.2.
2. Verify that the Home tab is selected.
3. Click the Folders riser bar.
4. Click the Basics Demos repository.
5. Click
 Data Job.
a. Double-click the batch_jobs folder. (This action makes this folder the value of the Save in field.)
b. Type Ch4D5_StandardizationDefinitionAndScheme in the Name field.
c. Click OK.
The new data job appears on a tab.
4-64
Chapter 4 ACT
6. Add the Text File Input node to the Data Flow Editor.
a. Verify that the Nodes riser bar is selected in the Resource pane.
b. If necessary, click
c. Click
to collapse the Data Outputs grouping of nodes.
in front of the Data Inputs grouping of nodes.
d. Double-click the Text File Input node.
An instance of the node appears in the data flow and the properties window for the node appears.
7. Specify the properties of the Text File Input node.
a. Type Prospective Customers in the Name field.
b. Click
next to the Input file field.
1) Navigate to S:\Workshop\dqdmp1\data\Text Files.
2) Click Prospect_List.txt.
3) Click Open.
c. Select “ (double quotation mark) as the value for the Text qualifier field.
d. Verify that Comma is the value for the Field delimiter field.
e. Click Number of rows to skip.
f.
Verify that 1 (one) is the default value for the number of rows to skip.
g. Click Import under the Fields area.
1) If necessary, navigate to S:\Workshop\dqdmp1\data\Text Files.
2) Click Prospect_List.dfl.
3) Click Open.
h. Click OK to close the Text File Input Properties window.
8. Edit the Basic Settings for the Text File Input node, and then Preview the data.
a. If necessary, click View  Show Details Pane.
b. If necessary, click the Basic Settings tab.
c. Type Text File Input as the value for the Description field.
d. Click the Preview tab.
e. Right-click the Text File Input node and select Preview.
4.2 Data Quality Jobs
9. Add a Standardization node to the data flow.
a. In the Nodes resource pane, click
b. Click
to collapse the Data Inputs grouping of nodes.
in front of the Quality grouping of nodes.
c. Double-click the Standardization node.
The node appears in the data flow. The Standardization Properties window appears.
10. Specify the properties for the Standardization node.
a. Type Standardize State in the Name field.
b. Double-click the State field to move it from the Available list box to the Selected list box.
c. For the State field, click
under Definition and select State/Province(Abbreviation).
d. Click the Preserve null values check box.
The Standardization Properties window should resemble the following:
e. Click OK to close the Standardization Properties window.
11. Edit the Basic Settings for the Standardization node.
a. If necessary, click View  Show Details Pane.
4-65
4-66
Chapter 4 ACT
b. If necessary, click the Basic Settings tab.
c. Type Standardization as the value for the Description field.
12. Add a Text File Output node to the job flow.
a. In the Nodes resource pane, click
b. Click
to collapse the Quality grouping of nodes.
in front of the Data Outputs grouping of nodes.
c. Double-click the Text File Output node.
An instance of the node appears in the data flow. The Text File Output Properties window
appears.
13. Specify the properties for the Text File Output node.
a. Type Customer Info – Standard State in the Name field.
b. Specify the output file information.
1) Click
next to the Output file field.
2) Navigate to S:\Workshop\dqdmp1\Demos\files\output_files.
3) Type Ch4D5_Customer_StdState.txt in the File name field.
4) Click Save.
c. Specify attributes for the file.
1) Verify that the Text qualifier field is set to “ (double quotation mark).
2) Verify that the Field delimiter field is set to Comma.
3) Click Include header row.
4) Click Display file after job runs.
d. Accept all fields as selected fields for the output file.
e. Export the field layout.
1) Click Export.
2) If necessary, navigate to S:\Workshop\dqdmp1\Demos\files\output_files.
3) Type Ch4D5_Customer_StdState.dfl in the File name field.
4) Click Save.
f.
Click OK to save the changes and close the Text File Output Properties window.
14. Edit the Basic Settings for the Text File Output node.
a. If necessary, click View  Show Details Pane.
b. If necessary, click the Basic Settings tab.
4.2 Data Quality Jobs
c. Type Text File Output as the value for the Description field.
The data job should resemble the following:
15. Save the job.
a. Click the Data Flow tab.
b. Select File  Save.
16. Run the job.
a. Verify that the Data Flow tab is selected.
b. Select Actions  Run Data Job.
c. Verify that the text file appears.
d. Select File  Exit to close the Notepad window.
e. Notice the processing information on each node.
4-67
4-68
Chapter 4 ACT
Profiling State and State_Stnd and Creating a Scheme
1. Click the Home tab.
1. If necessary, click the Folders riser bar.
2. Click the Basics Demos repository.
3. Click
 Profile.
a. Double-click the profiles_and_explorations folder. (This action makes this folder the value of
the Save in field.)
b. Type Ch4D5_TextFile_Profile in the Name field.
c. Click OK.
The new profile appears on a tab.
4. On the Properties tab, click Text Files.
5. Select Insert  New Text File.
a. Type Customer Info in the Table name field.
b. Click OK to close the Insert Text File window.
6. Specify delimited file information.
a. Click
next to the Input file name field.
The Open window appears.
4.2 Data Quality Jobs
1) Navigate to S:\Workshop\dqdmp1\Demos\files\output_files.
2) Click the Ch4D5_Customer_StdState.txt file.
3) Click Open.
The path and filename are returned to the Input file name field.
b. Click Import below the Fields area.
1) If necessary, navigate to S:\Workshop\dqdmp1\Demos\files\output_files.
2) Click the Ch4D5_Customer_StdState.dfl file.
3) Click Open.
c. Click Number of rows to skip and verify that the default value is set to 1 (one).
d. Click OK to close the Delimited File Information window.
The fields area is populated with information about the fields in the text file.
7. Click the State check box.
8. Click the State_Stnd check box.
9. Select all metrics.
a. Select Tools  Default Profile Metrics.
b. Verify that all metrics are selected.
c. Click OK to close the Metrics window.
10. Select File  Save Profile to save the profile.
11. Select Actions  Run Profile Report.
a. Type Profiling State fields in the Description field.
b. Click OK to close the Run Profile window.
The profile executes. The status of the execution is displayed.
The Report tab becomes active.
12. Review the Profile report.
a. Click
in front of Customer Info to view the available fields.
b. Click the State field.
c. If necessary, click the Column Profiling tab.
4-69
4-70
Chapter 4 ACT
Notice the Unique Count metric.
d. Click the Frequency Distribution tab.
e. Click the State_Stnd field.
f.
Click the Column Profiling tab.
4.2 Data Quality Jobs
Notice the Pattern Count and Unique Count metrics.
g. Click the Frequency Distribution tab.
13. Create a scheme to “cover” the value PENSILVANEYA.
a. Click the Report tab.
b. Click the Customer Info text file.
c. If necessary, click the Standard Metrics tab.
d. On the Standard Metrics tab, click the State_Stnd field.
e. Select Actions  Build a Scheme.
The Report Generation window appears.
4-71
4-72
Chapter 4 ACT
f.
Click OK to close the Report Generation window.
The Scheme Builder window appears.
g. At the bottom of the Report side, type PA in the Standard field.
h. Click the value PENSILVANEYA.
i.
Click Add To Scheme.
j.
Select File  Save.
1) Type Ch4D5 State_PA_Scheme in the Name field.
2) Click Save.
k. Select File  Exit to close the Scheme Builder window.
Editing and Re-executing the Data Job and the Profile
1. Click the Ch4D5_StandardizationDefinitionAndScheme tab.
2. Edit the Standardize State node.
a. Right-click the Standardize State node and select Properties.
b. For the State field, click
under Scheme and select Ch4D5/Province(Abbreviation).
c. Click OK to save the updated standardization properties.
3. Save the job.
a. Click the Data Flow tab.
b. Select File  Save.
4. Run the job.
a. Verify that the Data Flow tab is selected.
b. Select Actions  Run Data Job.
c. Verify that the text file appears.
d. Select File  Exit to close the Notepad window.
5. Re-run the profile.
a. Click the Ch4D5_TextFile_Profile tab.
4.2 Data Quality Jobs
b. Select Actions  Run Profile Report.
1) Type Profiling Updated State field in the Description field.
2) Click OK to close the Run Profile window.
The profile executes. The status of the execution is displayed.
The Report tab becomes active.
6. Review the Profile report.
a. Click
in front of Customer Info to view the available fields.
b. Click the State_Stnd field.
c. If necessary, click the Column Profiling tab.
Notice the Pattern Count and Unique Count metrics.
4-73
4-74
Chapter 4 ACT
d. Click the Frequency Distribution tab.
7. Select File  Close Profile.
8. Select File  Close (to close the data job).
4.3 Data Enrichment Jobs (Self-Study)
4-75
4.3 Data Enrichment Jobs (Self-Study)
Objectives

Explore address verification and geocoding.
46
Address Verification
Verified Address Information
Original Address
940 Cary Parkway
27513
Street Address
940 NW CARY PKWY
City
CARY
State
NC
Zip
27513
Zip+4
2792
County Name
WAKE
Congressional District 4
47
Address verification identifies, corrects, and enhances address information. Data files can be licensed to
verify address information for U.S., Canada, and international addresses.
4-76
Chapter 4 ACT
Geocoding
Original Address
Verified Geocode Information
940 NW CARY PKWY
Latitude
35.811753
CARY , NC 27513-2792
Longitude
-78.802326
48
Geocoding latitude and longitude information can be used to map locations and plan efficient delivery
routes. Geocoding can be licensed to return this information for the centroid of the postal code or at the
roof-top level.

Currently, there are only geocoding data files for the United States and Canada. Also, currently,
roof-top level geocoding is available only for the United States.
4.3 Data Enrichment Jobs (Self-Study)
4-77
Performing Address Verification and Geocoding
In this demonstration, you use an Address Verification node to verify the addresses present in the
Customers table from dfConglomerate Gifts. The resultant verified ZIP field is used in a Geocoding
node to produce additional desired geocode columns. This final result set is written to a text file.
1. If necessary, select Start  All Programs  DataFlux  Data Management Studio 2.2.
2. Verify that the Home tab is selected.
3. Click the Folders riser bar.
4. Click the Basics Demos repository.
5. Click
 Data Job.
a. Double-click the batch_jobs folder. (This action makes this folder the value of the Save in field.)
b. Type Ch4D6_AddressVerification_Example in the Name field.
c. Click OK.
The new data job appears on a tab.
6. Add the Data Source node to the Data Flow Editor.
a. Verify that the Nodes riser bar is selected in the Resource pane.
b. Click
in front of the Data Inputs grouping of nodes.
4-78
Chapter 4 ACT
c. Double-click the Data Source node.
An instance of the node appears in the data flow. The properties window for the node appears.
d. Type Customers Table in the Name field.
e. Click
1) Click
next to the Input table field.
in front of the dfConglomerate Gifts data source.
2) Click Customers.
3) Click OK to close the Select Table window.
f.
Click OK to save the changes and close the Data Source Properties window.
7. Edit the Basic Settings for the Data Source node.
a. If necessary, click View  Show Details Pane.
b. If necessary, click the Basic Settings tab.
c. Type Data Source as the value for the Description field.
4.3 Data Enrichment Jobs (Self-Study)
8. Add an Address Verification (US/Canada) node to the job flow.
a. In the Nodes resource pane, click
b. Click
to collapse the Data Inputs grouping of nodes.
in front of the Enrichment grouping of nodes.
c. Double-click the Address Verification (US/Canada) node.
The node appears in the data flow. The Address Verification (US/Canada) Properties window
appears.
9. Specify properties for the Address Verification (US/Canada) node.
a. Type Address Verification in the Name field.
b. Verify that United States is selected in the Address area.
4-79
4-80
Chapter 4 ACT
c. Specify the Input information.
1) Click
under Field Type for the ADDRESS field and click Address Line 1.
2) Click
under Field Type for the CITY field and click City.
3) Click
under Field Type for the STATE/PROVINCE field and click State.
4) Click
under Field Type for the ZIP/POSTAL CODE field and click Zip.
d. Specify options for the address verification.
1) Click Options.
2) Click Proper case results.
3) Click Street abbreviation.
4) Click City abbreviation.
5) Clear CASS compliance.
6) Verify that Insert dash between ZIP and ZIP4 is selected.
7) Click OK to save the settings and close the Options window.
4.3 Data Enrichment Jobs (Self-Study)
4-81
e. Specify the output fields information.
1) Double-click Address Line 1 to move this field from the Available list box to the Selected
list box.
2) Double-click City to move this field from the Available list box to the Selected list box.
3) Double-click State to move this field from the Available list box to the Selected list box.
4) Double-click ZIP/Postal_Code to move this field from the Available list box to the Selected
list box.
5) Double-click US_ZIP to move this field from the Available list box to the Selected list box.
6) Double-click US_ZIP4 to move this field from the Available list box to the Selected list box.
7) Double-click US_Result_Code to move this field from the Available list box to the Selected
list box.
8) Double-click US_Numeric_Result_Code to move this field from the Available list box to the
Selected list box.
9) For the Output Name field, type _Verified after Address_Line_1.
10) For the Output Name field, type _Verified after City.
11) For the Output Name field, type _Verified after State.
12) For the Output Name field, type _Verified after ZIP/Postal_Code.
The final set of selected output fields should resemble the following:
f.
Limit the original fields in output.
1) Click Additional Outputs.
2) Click the EMAIL field in the Output fields area, and then click
3) Click the JOB TITLE field in the Output fields area, and then click
.
.
4) Click the BUSINESS PHONE field in the Output fields area, and then click
.
4-82
Chapter 4 ACT
5) Click the HOME PHONE field in the Output fields area, and then click
.
.
6) Click the MOBILE PHONE field in the Output fields area, and then click
7) Click the FAX NUMBER field in the Output fields area, and then click
.
8) Click the COUNTRY/REGION field in the Output fields area, and then click
9) Click the NOTES field in the Output fields area, and then click
.
.
The final set of additional output fields should resemble the following:
10) Click OK to close the Additional Outputs window.
g. Click OK to close the Address Verification (US/Canada) Properties window.
10. Establish an appropriate description and preview the Address Verification (US/Canada) node.
a. If necessary, click the Address Verification node in the job diagram.
b. Verify that the Details pane is displayed.
c. If necessary, click the Basic Settings tab.
d. Type Verify Addresses as the value for the Description field.
e. Click the Address Verification node in the data flow diagram.
f.
Click the Preview tool (
).
A sample of records appears on the Preview tab of the Details panel.
g. Scroll to view the results.
4.3 Data Enrichment Jobs (Self-Study)
4-83
Notice the following:
• The street types in the original Address field are abbreviated in the verified Address field.
• Some verified city values have been updated to a new value that corresponds to the zip code
information.
• Some city “words” in the original City field are abbreviated in the verified city field.
• Some States field values not of the pattern AA are verified as state values with the pattern AA.
• Most of the preview sample have a U.S. result code of OK and U.S. numeric result code of 0.
The US_Result_Code field indicates whether the address was successfully verified. In the case where
the address was not successfully verified, a U.S. return code indicates the cause of address
verification failure. A numeric result to the US_Result_Code field is the US_Numeric_Result_Code
field. Possible values for both fields are as follows:
Text
Result
Code
Numeric
Result
Code
Description
OK
0
Address was verified successfully.
PARSE
11
Error parsing address. Components of the address might be missing.
CITY
12
Could not locate city, state, or ZIP code in the USPS database. At least
city and state or ZIP code must be present in the input.
MULTI
13
Ambiguous address. There are two or more possible matches for this
address with different data.
NOMATCH
14
No matching address is found in the USPS data.
OVER
15
One or more input strings is too long (maximum 100 characters).
4-84
Chapter 4 ACT
11. Add a Geocoding node to the job flow.
a. If necessary, click
in front of the Enrichment grouping of nodes.
b. Double-click the Geocoding node.
The node appears in the data flow. The Geocoding Properties window appears.
12. Specify the properties for the Geocoding node.
a. Type Geocoding using Zip Verified field in the Name field.
under Field Type for the ZIP/Postal_Code_Verified field.
b. In the Input area, click
Click ZIP+4 (US Only).
c. In the Output fields area, click
to move all available fields to the Selected list box.
d. Click OK to save the settings and close the Geocoding Properties window.
4.3 Data Enrichment Jobs (Self-Study)
13. Establish an appropriate description and preview the Geocoding node.
a. If necessary, click the Geocoding node in the job diagram.
b. Verify that the Details pane is displayed.
c. If necessary, click the Basic Settings tab.
d. Type Geocoding as the value for the Description field.
e. Click the Geocoding node in the data flow diagram.
f.
Click the Preview tool (
).
A sample of records appears on the Preview tab of the Details panel.
g. Scroll to view the results.
Notice the following:
• The Geocode_Result_Code field has mainly OK as a returned value.
• Records where the verified ZIP code was only five-digits have a Geocode_Result_Code field
value of OK5DIGIT. These records have only Geocode_Latitude and Geocode_Longitude
as additional returned information. The remaining output fields are blank.
The field definitions are shown below:
Available Field Name
Description
Geocode_Result_Code
The result code indicates whether the record was successfully
geocoded. Other possible codes are as follows:
•
•
•
•
DP - The match is based on the delivery point.
PLUS4 - The match failed on the delivery point, so the match is
based on ZIP+4.
ZIP - The ZIP+4 match failed, so the match is based on the ZIP
code.
NOMATCH - The first three checks failed, so there is no match in
the geocoding database.
Geocode_Latitude
This is the numerical horizontal map reference for address data.
Geocode_Longitude
This is the vertical map reference for address data.
Geocode_Census_Tract
This is a U.S. Census Bureau reference number assigned using the
centroid latitude and longitude. This number contains references to the
State and County codes.
Geocode_FIPS
The U.S. Census Bureau uses a Federal Information Processing
Standards (FIPS) number to refer to certain data.
Geocode_Census_Block
This is the last four digits of the State/County/Tract/Block value.
4-85
4-86
Chapter 4 ACT
14. Add a Text File Output node to the job flow.
a. In the Nodes resource pane, click
b. Click
to collapse the Enrichment grouping of nodes.
in front of the Data Outputs grouping of nodes.
c. Double-click the Text File Output node.
An instance of the node appears in the data flow. The Text File Output Properties window
appears.
15. Specify properties for the Text File Output node.
a. Type Customer Info – Verify/Geocode in the Name field.
b. Specify the output file information.
1) Click
next to the Output file field.
2) Navigate to S:\Workshop\dqdmp1\Demos\files\output_files.
3) Type Ch4D6_Customer_Verify_Geocode.txt in the File name field.
4) Click Save.
c. Specify attributes for the file.
1) Verify that the Text qualifier field is set to “ (double quotation mark).
2) Verify that the Field delimiter field is set to Comma.
3) Click Include header row.
4) Click Display file after job runs.
d. Accept all fields as selected fields for the output file.
e. Export the field layout.
1) Click Export.
2) If necessary, navigate to S:\Workshop\dqdmp1\Demos\files\output_files.
3) Type Ch4D6_Customer_Verify_Geocode.dfl in the File name field.
4) Click Save.
f.
Click OK to save the changes and close the Text File Output Properties window.
16. Establish an appropriate description for the Text File Output node.
a. If necessary, click the Text File Output node in the job diagram.
b. Verify that the Details pane is displayed.
c. If necessary, click the Basic Settings tab.
d. Type Output Text File as the value for the Description field.
4.3 Data Enrichment Jobs (Self-Study)
The data job should resemble the following:
17. Save the job.
a. Click the Data Flow tab.
b. Select File  Save.
18. Run the job.
a. Verify that the Data Flow tab is selected.
b. Select Actions  Run Data Job.
c. Verify that the text file appears.
d. Select File  Exit to close the Notepad window.
e. Notice the processing information on each node.
19. Select File  Close.
4-87
4-88
Chapter 4 ACT
Exercises
3. Performing Address Verification on the MANUFACTURERS Table
Create a data job that works with the Address Verification node. The final job flow should resemble
the following:
•
Create a data job (Ch4E3_MANUFACTURERS_Verify) in the batch_jobs folder of the Basics
Exercises repository.
•
Use the MANUFACTURERS table from dfConglomerate Grocery as the data source.
•
Add an Address Verification (US/Canada) node to the job flow. Use the following specifications:
− Map the following input fields:
−
Field Name
Field Type
MANUFACTURER
Firm
STREET_ADDR
Address Line 1
CITY
City
STATE_PROV
State
POSTAL_CD
Zip
Specify the following options:
 Proper case results
 Output blanks as nulls
 Street abbreviation
 City abbreviation
 Insert dash between ZIP when ZIP4 is selected
4.3 Data Enrichment Jobs (Self-Study)
−
Specify the following output fields:
Output Type
Output Name
Output Type
Output Name
Firm
Firm_Verified
ZIP/Postal Code
ZIP_Verified
Address Line 1
Address_Verified
US County Name
US_County_Name
City
City_Verified
US Result Code
US_Result_Code
State
State_Verified
•
Retain only the following list of original fields for additional outputs:
ID
STATE/PROV
MANUFACTURER
POSTAL_CD
STREET_ADDR
COUNTRY
CITY
PHONE
•
Add a Text File Output node to the job flow. Use the following specifications:
Output File:
S:\Workshop\dqdmp1\Exercises\files\OutputFiles\Ch4E3_Manufacturers_Verify.txt.
Text qualifier:
“ (double quotation mark)
Field delimiter:
Comma
 Include header row
 Display the file after job runs.
Specify the following fields for the text file with the specified output name:
•
Field Name
Output Name
ID
ID
Firm_Verified
Manufacturer
Address_Verified
Address
City_Verified
City
State_Verified
State
ZIP_Verified
ZIP
US_County_Name
US_County_Name
US_Result_Code
US_Result_Code
Save and run the job. Verify that the text file is created properly.
4-89
4-90
Chapter 4 ACT
4.4 Entity Resolution Jobs
Objectives


53
Define match codes.
Describe cluster conditions.
4.4 Entity Resolution Jobs
4-91
Match Codes
Name
Match Code @ 85 Sensitivity
John Q Smith
[email protected]$$$$$$
Johnny Smith
[email protected]$$$$$$
Jonathon Smythe [email protected]$$$$$$
Match code generation process:
1. Data is parsed into its components (for example, Given Name and
Family Name).
2. Ambiguities and noise words are removed (for example, the).
3. Transformations are made (for example, Jonathon > Jon).
4. Phonetics are applied (for example, PH > F).
5. Based on the sensitivity selection, the following occurs:
 Relevant components are determined.
 Only a certain number of characters of the transformed relevant
components are used.
54
DataFlux uses the strengths of probabilistic and deterministic matching. DataFlux provides dichotomous
(true/false) results. It also takes advantage of the flexibility in scoring of the probabilistic matching
technique. With a single pass through the data, the DataFlux matching engine produces an unambiguous
match code that represents the identifying variables that were specified by the user. The key to the
DataFlux technology is this match code.
• Prior to match code generation, the matching engine parses the data into its components and implicitly
removes all ambiguities in the system. Using industry-leading data quality algorithms, it removes nonessential matching information and standardizes the data.
• The matching engine enables near matching by enabling you to specify a match threshold (sensitivity)
for each identifying variable. Match codes are generated in direct correlation to the desired sensitivity.
If the sensitivity is higher, the matching requirements are more stringent and the match code is more
specific. This enables you to specify high matching requirements for some variables and lower
requirements for others.
• The matching engine enables the specification of configurable business rules that impact the match.
The rules are implicitly applied before generating the match code. This enables you to make specific
corrections in the data before performing any linkage. Changing the matching logic requires the use
of the Data Management Studio – Customize module.
• Regardless of the data source, the matching engine creates the same match code as long as the same
match definition and sensitivity setting are selected when the match code is generated. This enables
matching across multiple data sources without constantly reprocessing and re-indexing the data, which
is something that probabilistic matching systems cannot do.
4-92
Chapter 4 ACT
Clustering (Grouping Records)
Grouping on the fields:
 Name_MC_85 and Address_MC_85
(matches records 1 and 2)
 Name_MC_85 and Phone
(matches records 2 and 3)
Grouping on both conditions:
 Name_MC_85 and Address_MC_85
or
 Name_MC_85 and Phone
(causes all 3 records to match)
55
The Clustering node provides the ability to match records based on multiple conditions. Create the
conditions that support your business needs.
4.4 Entity Resolution Jobs
4-93
Creating a Data Job to Cluster Records
In this demonstration, you will generate clusters of data based on three conditions. The three conditions
will take advantage of match codes generated on various fields, as well as standardized field information.
A match report will display the results of the clustering.
1. If necessary, select Start  All Programs  DataFlux  Data Management Studio 2.2.
2. Verify that the Home tab is selected.
3. Click the Folders riser bar.
4. Click the Basics Demos repository.
5. Click
 Data Job.
a. Double-click the batch_jobs folder. (This action makes this folder the value of the Save in field.)
b. Type Ch4D7_ClusterRecords_Example in the Name field.
c. Click OK.
The new data job appears on a tab.
6. Add the Data Source node to the Data Flow Editor.
a. Verify that the Nodes riser bar is selected in the resource pane.
b. Click
in front of the Data Inputs grouping of nodes.
c. Double-click the Data Source node.
An instance of the node appears in the data flow. The properties window for the node appears.
d. Type Customers Table in the Name field.
e. Click
1) Click
next to the Input table field.
in front of the dfConglomerate Gifts data source.
2) Click Customers.
3) Click OK to close the Select Table window.
4-94
Chapter 4 ACT
f.
Click OK to save the changes and close the Data Source Properties window.
7. Establish an appropriate description for the Data Source node.
a. If necessary, click the Data Source node in the job diagram.
b. Verify that the Details pane is displayed.
c. If necessary, click the Basic Settings tab.
d. Type Data Source as the value for the Description field.
8. Add a Match Codes (Parsed) node to the job flow.
a. In the Nodes resource pane, click
b. Click
to collapse the Data Inputs grouping of nodes.
in front of the Entity Resolution grouping of nodes.
c. Double-click the Match Codes (Parsed) node.
The node appears in the data flow. The Match Codes (Parsed) Properties window appears.

The (Parsed) nodes are used when the input data is parsed. Because the Name Match
definition is designed for a full name, if you only generate a match code on
FIRST_NAME, in most cases the definition would assume this was a last name. Thus,
the matching of nicknames would not happen. For example, Jon would not match John.
4.4 Entity Resolution Jobs
9. Specify properties for the Match Codes (Parsed) node.
a. Type Match Codes for Parsed Name in the Name field.
b. Type Name_MatchCode in the Output field field.
c. Click
under Sensitivity and select 85.
d. Click
under Definition and select Name.
e. Click
under Field Name for the Given Name token and click FIRST NAME.
f.
under Field Name for the Family Name token and click LAST NAME.
Click
g. Click Generate null match codes for blank field values.
h. Click Preserve null values.
The final settings for the Match Codes (Parsed) node should resemble the following:
4-95
4-96
Chapter 4 ACT
i.
Click OK to close the Match Codes (Parsed) Properties window.

The Allow generation of multiple match codes per definition option requires the creation
of a special match definition in the Data Management Studio – Customize module.

The Generate null match codes for blank field values option generates a NULL match
code if the field is blank. If this option is not selected, then a match code of all $ symbols
is generated for the field. When you match records, a field with NULL does not equal
another field with NULL. A field with all $ symbols equals another field with all $
symbols.
10. Establish an appropriate description and preview the Match Codes (Parsed) node.
a. If necessary, click the Match Codes (Parsed) node in the job diagram.
b. Verify that the Details pane is displayed.
c. If necessary, click the Basic Settings tab.
d. Type Parsed Input as the value for the Description field.
e. Click the Match Code (Parsed) node in the data flow diagram.
f.
Click the Preview tool (
).
A sample of records appears on the Preview tab of the Details panel.
g. Scroll to the right to view the Name_MatchCode field.
4.4 Entity Resolution Jobs
4-97
11. Add a Match Codes node to the job flow.
a. Verify that the Entity Resolution grouping of nodes is expanded.
b. Double-click the Match Codes node.
The node appears in the data flow. The Match Codes Properties window appears.
12. Specify properties for the Match Codes node.
a. Type Match Codes for Various Fields in the Name field.
b. Double-click the COMPANY field to move it from the Available list box to the Selected list box.
c. Click
under Definition and select Organization.
d. Accept the default Sensitivity of 85.
e. Double-click the ADDRESS field to move it from the Available list box to the Selected list box.
f.
Click
under Definition and select Address.
g. Accept the default Sensitivity of 85.
h. Double-click the CITY field to move it from the Available list box to the Selected list box.
under Definition and select City.
i.
Click
j.
Accept the default Sensitivity of 85.
k. Double-click the STATE/PROVINCE field to move it from the Available list box to the Selected
list box.
l.
Click
under Definition and select State/Province.
m. Accept the default Sensitivity of 85.
n. Double-click the ZIP/POSTAL CODE field to move it from the Available list box to the
Selected list box.
o. Click
under Definition and select Postal Code.
p. Accept the default Sensitivity of 85.
q. Click Generate null match codes for blank field values.
r.
Click Preserve null values.
4-98
Chapter 4 ACT
The final settings for the Match Codes Properties window should resemble the following:
s.
Click OK to close the Match Codes Properties window.
4.4 Entity Resolution Jobs
4-99
13. Establish an appropriate description and preview the Match Codes node.
a. If necessary, click the Match Codes node in the job diagram.
b. Verify that the Details pane is displayed.
c. If necessary, click the Basic Settings tab.
d. Type Match Codes as the value for the Description field.
e. Click the Match Codes node in the data flow diagram.
f.
Click the Preview tool (
).
A sample of records appears on the Preview tab of the Details panel.
g. Scroll to the right to view the _MatchCode fields.

Generating a match code performs an out-of-the-box standardization behind the scenes.
Unless you want to write the standardized values or you want to perform custom
standardizations by using a scheme or a modified definition, you do not need to standardize
before generating match codes.
14. Add a Standardization node to the job flow.
a. In the Nodes resource pane, click
b. Click
to collapse the Entity Resolution grouping of nodes.
in front of the Quality grouping of nodes.
c. Double-click the Standardization node.
The node appears in the data flow. The Standardization Properties window appears.
4-100
Chapter 4 ACT
15. Specify properties for the Standardization node.
a. Type Standardize Phone in the Name field.
b. Double-click BUSINESS PHONE to move it from the Available list box to the Selected list box.
c. For the BUSINESS PHONE field, click
under Definition and select Phone.
d. Click Preserve null values.
The Standardization Properties window should resemble the following:
e. Click OK to close the Standardization Properties window.
16. Establish an appropriate description for the Standardization node.
a. If necessary, click the Standardization node in the job diagram.
b. Verify that the Details pane is displayed.
c. If necessary, click the Basic Settings tab.
d. Type Standardization as the value for the Description field.
4.4 Entity Resolution Jobs
17. Add a Clustering node to the job flow.
a. In the Nodes resource pane, click
b. Click
to collapse the Quality grouping of nodes.
in front of the Entity Resolution grouping of nodes.
c. Double-click the Clustering node.
The node appears in the data flow. The Clustering Properties window appears.
18. Specify properties for the Clustering node.
a. Type Cluster on Three Conditions in the Name field.
b. Type Cluster_ID in the Output cluster ID field field.
c. Click Options.
1) Click Condition matched field prefix.
2) Type COND_ in the Condition matched field prefix field.
3) Click OK.
d. Click Sort output by cluster number.
e. Verify that All clusters is selected in the Cluster area.
The clustering properties at this point should resemble the following:
f.
Specify the first condition.
1) Double-click Name_MatchCode to move from Available fields to Selected fields.
4-101
4-102
Chapter 4 ACT
2) Double-click COMPANY_MatchCode to move from Available fields to Selected fields.
3) Double-click ADDRESS_MatchCode to move from Available fields to Selected fields.
4) Double-click CITY_MatchCode to move from Available fields to Selected fields.
5) Double-click STATE/PROVINCE_MatchCode to move from Available fields to Selected
fields.
g. Specify the second condition.
1) Click OR.
2) Double-click Name_MatchCode to move from Available fields to Selected fields.
3) Double-click COMPANY_MatchCode to move from Available fields to Selected fields.
4) Double-click ADDRESS_MatchCode to move from Available fields to Selected fields.
5) Double-click ZIP/POSTAL CODE_MatchCode to move from Available fields to Selected
fields.
h. Specify the third condition.
1) Click OR.
2) Double-click Name_MatchCode to move from Available fields to Selected fields.
3) Double-click COMPANY_MatchCode to move from Available fields to Selected fields.
4) Double-click BUSINESS PHONE_MatchCode to move from Available fields to Selected
fields.
The final set of conditions should resemble the following:
i.
Click OK to close the Clustering Properties window.
19. Establish an appropriate description and preview the Match Codes node.
a. If necessary, click the Clustering node in the job diagram.
b. Verify that the Details pane is displayed.
c. If necessary, click the Basic Settings tab.
4.4 Entity Resolution Jobs
4-103
d. Type Clustering as the value for the Description field.
e. Click the Clustering node in the data flow diagram.
f.
Click the Preview tool (
).
A sample of records appears on the Preview tab of the Details panel.
g. Scroll to the right to view the Cluster_ID and COND_ fields.
h. Scroll down through the observations.
Notice that some records have duplicate values for Cluster_ID.
The COND_ fields display true if that condition (1, 2, or 3) is met to form the cluster. Otherwise,
the COND_ fields display false.
20. Add a Match Report node to the job flow.
a. In the Nodes resource pane, click
b. Click
to collapse the Entity Resolution grouping of nodes.
in front of the Data Outputs grouping of nodes.
c. Double-click the Match Report node.
The node appears in the data flow. The Match Report Properties window appears.
21. Specify properties for the Match Report node.
a. Type Customers Match Report in the Name field.
b. Click
next to the Report file field.
1) Navigate to S:\Workshop\dqdmp1\Demos\files\output_files.
2) Type Ch4D7_Customers_MatchReport in the File name field.
3) Click Save.
c. Type Customers Match Report – Three Conditions in the Report Title field.
d. Click Launch Viewer after job is completed.
4-104
Chapter 4 ACT
e. Click
next to Cluster field and select Cluster_ID.
At this point, the Match Report Properties window should resemble the following:
f.
Specify report fields.
1) Click
to move all fields from the Selected list box.
2) Double-click Cluster_ID to move this field from the Available list box to the Selected
list box.
3) Double-click ID to move this field from the Available list box to the Selected list box.
4) Double-click COMPANY to move this field from the Available list box to the Selected
list box.
5) Double-click LAST NAME to move this field from the Available list box to the Selected
list box.
6) Double-click FIRST NAME to move this field from the Available list box to the Selected
list box.
7) Double-click ADDRESS to move this field from the Available list box to the Selected
list box.
8) Double-click CITY to move this field from the Available list box to the Selected list box.
9) Double-click STATE/PROVINCE to move this field from the Available list box to the
Selected list box.
10) Double-click ZIP/POSTAL CODE to move this field from the Available list box to the
Selected list box.
11) Double-click BUSINESS PHONE_Stnd to move this field from the Available list box to the
Selected list box.
4.4 Entity Resolution Jobs
The final set of selected fields should resemble the following:
g. Click OK to close the Match Report Properties window.
22. Establish an appropriate description for the Match Report node.
a. If necessary, click the Match Report node in the job diagram.
b. Verify that the Details pane is displayed.
c. If necessary, click the Basic Settings tab.
d. Type Match Report as the value for the Description field.
The final job flow should resemble the following:

The above diagram was redrawn to facilitate viewing the display.
23. Save the job.
a. Click the Data Flow tab.
b. Select File  Save.
4-105
4-106
Chapter 4 ACT
24. Run the job.
a. Verify that the Data Flow tab is selected.
b. Select Actions  Run Data Job.
c. Verify that the dfReport Viewer appears.
Because you requested all clusters, you get many “clusters” with one record. Use these tools to
scroll through the pages of clusters:
.
QUESTION: What is the largest number of records in a cluster?
Answer:
Three (3)
d. Select File  Exit to close the dfReport Viewer window.
4.4 Entity Resolution Jobs
Notice the processing information on each node.
25. View the log information.
a. Click the Log tab.
b. Review the information for each of the nodes.
26. Click the Data Flow tab.
27. Select File  Close.
4-107
4-108
Chapter 4 ACT
Exercises
4. Creating a Data Job that Generates a Match Report
Create a data job that works with the Surviving Record Identification node. The final job flow should
resemble the following:
•
Create a new data job named Ch4E4_Manufacturers_MatchReport in the batch_jobs folder of
the Basics Exercises repository.
•
Add the MANUFACTURERS table in the dfConglomerate Grocery data connection as the data
source. Select the following fields:
ID
MANUFACTURER
CONTACT
CONTACT_ADDRESS
CONTACT_CITY
CONTACT_STATE_PROV
CONTACT_POSTAL_CD
CONTACT_CNTRY
CONTACT_PHONE
POSTDATE
•
Use the specified definitions and sensitivities. Accept the default names for the match code fields.
Be sure to generate null match codes for blank field values and to preserve null values. Generate
match codes for the following fields:
Field Name
Definition
Sensitivity
MANUFACTURER
Organization
75
CONTACT
Name
75
CONTACT_ADDRESS
Address
85
CONTACT_STATE_PROV
State/Province
85
CONTACT_POSTAL_CD
Postal Code
85
CONTACT_PHONE
Phone
95
4.4 Entity Resolution Jobs
•
4-109
Cluster similar records using the following two conditions. Create a cluster field named
Cluster_ID. The output must contain all clusters and be sorted by cluster number.
MANUFACTURER_MatchCode
CONTACT_MatchCode
CONTACT_ADDRESS_MatchCode
CONTACT_POSTAL_CD_MatchCode
OR
MANUFACTURER_MatchCode
CONTACT_MatchCode
CONTACT_STATE_PROV_MatchCode
CONTACT_PHONE_MatchCode
•
Create a match report to display the cluster groupings. The match report should be opened
automatically after the job is run. Name the match report
Ch4E4_Manufacturers_MatchReport.mre and store it in the directory
S:\Workshop\dqdmp1\Exercises\files\output_files. Display the following fields for the report:
ID
MANUFACTURER
CONTACT
CONTACT_ADDRESS
CONTACT_CITY
CONTACT_STATE_PROV
CONTACT_POSTAL_CD
CONTACT_CNTRY
CONTACT_PHONE
•
Save and run the data job. Review the resulting match report and log.
4-110
Chapter 4 ACT
Surviving Record Identification – Record Rules
Defined Record-level Rules:
1. FIRST NAME = Highest Occurrence
2. ID = Maximum Value
63
Record-level rules select which record from a grouping should survive. If there is ambiguity about which
record is the survivor, the tool selects the first remaining record in the grouping.
Surviving Record Identification – Field Rules
Defined Field-level Rules:
1. Where EMAIL is Not Null, Select EMAIL
2. Where STATE is Shortest Value, Select CITY
and STATE
64
Field rules are used to “steal” information from other records in the grouping.
4.4 Entity Resolution Jobs
4-111
Entity Resolution File
65
6
5
Entity resolution is the process of merging duplicate records in a single file or multiple files so that
records referring to the same physical object are treated as a single record. Records are matched based on
the information that they have in common. The records that are merged might appear to be different but
can actually refer to the same person or item. For example, a record for John Q. Smith at 220 Academy
Street might be the same person as J. Q. Smith at the same address.
The Entity Resolution File enables you to manually review the merged records and make adjustments as
necessary. This can involve the following tasks:
• Examining clusters
• Reviewing the Cluster Analysis section
• Reviewing related clusters
• Processing cluster records
4-112
Chapter 4 ACT
Creating an Entity Resolution Job
In this demonstration, you investigate various settings for the Surviving Record Identification node.
In addition, you explore the Entity Resolution File Output node.
1. If necessary, select Start  All Programs  DataFlux  Data Management Studio 2.2.
2. Verify that the Home tab is selected.
3. Click the Folders riser bar.
4. Copy an existing job to use as a starting point.
a. Click
in front of the Basics Demos repository.
b. Click
in front of the batch_jobs folder.
c. Right-click the Ch4D7_ClusterRecords_Example job and select Copy.
d. Right-click the batch_jobs folder and select Paste.
A message window appears.
e. Click Rename the new item (Copy of …, Copy (2) of …, etc.).
f.
Click OK.
5. Rename the copied job.
a. Right-click Copy of Ch4D7_ClusterRecords_Example and select Rename.
b. Type Ch4D8_SRI_Example.
c. Press ENTER.
6. Double-click Ch4D8_SRI_Example.
The data job opens on a new tab.
4.4 Entity Resolution Jobs
7. Right-click the match report node (labeled Customers Match Report) and select Delete.
8. Add a Surviving Record Identification node to the job flow.
a. In the Nodes resource pane, click
in front of the Entity Resolution grouping of nodes.
b. Double-click the Surviving Record Identification node.
The node appears in the data flow. The Surviving Record Identification Properties window
appears.
9. Specify properties for the Surviving Record Identification node.
a. Type Select Best Record in the Name field.
b. Click
next to the Cluster ID field and select Cluster_ID.
c. Add one record rule.
1) Click Add in the Record rules area.
2) Click
next to Field and select ID.
3) Click
in the Operation field and select Maximum Value.
4) Click Add Condition. The Expression area is updated to the following:
5) Click OK to close the Add Record Rule Expression window.
d. Click OK to close the Surviving Record Identification Properties window.
10. Establish an appropriate description for the Surviving Record Identification node.
a. If necessary, click the Select Best Record node in the job diagram.
b. Verify that the Details pane is displayed.
c. If necessary, click the Basic Settings tab.
d. Type Surv Rec Id as the value for the Description field.
4-113
4-114
Chapter 4 ACT
11. Add a (temporary) Text File Output node to demonstrate surviving record options.
a. In the Nodes resource pane, click
b. Click
to collapse the Entity Resolution grouping of nodes.
in front of the Data Outputs grouping of nodes.
c. Double-click the Text File Output node.
An instance of the node appears in the data flow. The Text File Output Properties window appears.
12. Specify properties for the Text File Output node.
a. Type Test in the Name field.
b. Specify the output file information.
1) Click
next to the Output file field.
2) Navigate to S:\Workshop\dqdmp1\Demos\files\output_files.
3) Type Test.txt in the File name field.
4) Click Save.
c. Specify attributes for the file.
1) Verify that the Text qualifier field is set to “ (double quotation mark).
2) Verify that the Field delimiter field is set to Comma.
3) Click Include header row.
d. Move the Cluster_ID field so that it follows the ID field.
1) Scroll in the Selected list box and locate the Cluster_ID field.
2) Click
until Cluster_ID follows ID field.
e. Click OK to save the changes and close the Text File Output Properties window.
13. Establish an appropriate description for the Text File Output node.
a. If necessary, click the Text File Output node in the job diagram.
b. Verify that the Details pane is displayed.
c. If necessary, click the Basic Settings tab.
d. Type Text File Output as the value for the Description field.
14. Save the job.
a. Click the Data Flow tab.
b. Select File  Save.
15. Run the job.
a. Verify that the Data Flow tab is selected.
4.4 Entity Resolution Jobs
4-115
b. Select Actions  Run Data Job.
The processing information appears on each of the nodes.
Notice that the Surviving Record Identification node read 63 rows. For each multi-row cluster, only
one record was selected (the record with the maximum ID value). Therefore, the number of records
written to the text file is 57 rows. Selecting only one record from each cluster is the default action.
16. Edit the properties of the Surviving Record Identification node.
a. Right-click on the Select Best Record node and select Properties.
b. Click Options to the right of the Cluster ID field.
1) Click Keep duplicate records.
2) Type SR_Flag in the Surviving record ID field field.
3) Click OK to close the Options window.
c. Click OK to close the Surviving Record Identification Properties window.
17. Edit the properties of the Text File Output node.
a. Right-click the Text File Output node and select Properties.
b. Click Display file after job runs.
c. Locate the SR_Flag field in the Available list box under the Output fields area.
d. Click
to move the SR_Flag field to the Selected list box.
4-116
Chapter 4 ACT
e. Click the SR_Flag field in the Selected list box.
f.
Click
to move the SR_Flag up in the list box to follow the Cluster_ID field.
g. Click OK to close the Text File Output Properties window.
18. Save the job.
a. Click the Data Flow tab.
b. Select File  Save.
19. Run the job.
a. Verify that the Data Flow tab is selected.
b. Select Actions  Run Data Job.
The text file appears.
c. Select File  Exit to close the Notepad window.
4.4 Entity Resolution Jobs
4-117
The processing information appears on each of the nodes.
Notice that the text file output node wrote the entire 63 records. The text file shows that the surviving
record in a cluster has true in the SR_Flag field and the duplicate records (non-surviving) have false
in the SR_Flag field.
20. Edit the properties of the Surviving Record Identification node.
a. Right-click the Select Best Record node and select Properties.
b. Click Options to the right of the Cluster ID field.
1) Click Generate distinct surviving record.
2) Click
next to Primary key field and select ID.
3) Click OK to close the Options window.
c. Click OK to close the Surviving Record Identification Properties window.
4-118
Chapter 4 ACT
21. Save the job.
a. Click the Data Flow tab.
b. Select File  Save.
22. Run the job.
a. Verify that the Data Flow tab is selected.
b. Select Actions  Run Data Job.
The text file appears.
c. Select File  Exit to close the Notepad window.
4.4 Entity Resolution Jobs
4-119
The processing information appears on each of the nodes.
There is now an extra record for each cluster. The text file shows that the surviving record in a
cluster has true in the SR_Flag field and the duplicate records (non-surviving) have false in the
SR_Flag field. If a field (such as ID in this example) is not chosen as the Primary key field, then
the ID retains its value across all records in a cluster.
Thus, an Options specification such as the one shown below:
… produces output such as the following:
4-120
Chapter 4 ACT
23. Edit the properties of the Surviving Record Identification node.
a. Right-click on the Select Best Record node and select Properties.
b. Click Options to the right of the Cluster ID field.
1) Click Use primary key as surviving record ID.
2) Click
next to Primary key field and select ID.
3) Click OK to close the Options window.
c. Click OK to close the Surviving Record Identification Properties window.
24. Save the job.
a. Click the Data Flow tab.
b. Select File  Save.
25. Run the job.
a. Verify that the Data Flow tab is selected.
b. Select Actions  Run Data Job.
4.4 Entity Resolution Jobs
4-121
The text file appears.
c. Select File  Exit to close the Notepad window.
The processing information appears on each of the nodes.
The text file shows that the surviving record in a cluster has a null value in the SR_Flag field.
The duplicate records (non-surviving) have the value of the primary key value of the surviving
record in the SR_Flag field.
4-122
Chapter 4 ACT
26. Update the properties for the Surviving Record Identification node.
a. Right-click the Surviving Record Identification node and select Properties.
b. Click Field Rules in the Output fields area on the lower right.
c. Add the first of two field rules.
1) Click Add in the Field Rules window.
2) Click Add in the Rule expressions area of the Add Field rule window.
3) Click
next to Field and select EMAIL.
4) Click
in the Operation field and select Is Not Null.
5) Click Add Condition.
The Expression area is updated to the following:
6) Click OK to close the Add Field Rule Expression window.
7) Click OK to close the Add Field Rule window.
d. Add the second of two field rules.
1) Click Add in the Field Rules window.
2) Click Add in the Rule expressions area of the Add Field rule window.
3) Click
next to Field and select JOB TITLE.
4) Click
in the Operation field and select Is Not Null.
5) Click Add Condition.
6) Click AND.
7) Click
next to Field and select JOB TITLE.
8) Click
in the Operation field and select Longest value.
9) Click Add Condition.
10) Click OK to close the Add Field Rule Expression window.
4.4 Entity Resolution Jobs
4-123
11) In the Affected fields area, double-click the NOTES field to move it from the Available list
box to the Selected list box.
12) Click OK to close the Add Field Rule window.
13) Click OK to close the Field Rules window.
e. Click OK to close the Surviving Record Identification Properties window.
27. Edit the properties of the Text File Output node.
a. Right-click the Text File Output node and select Properties.
b. Click the EMAIL field in the Selected list box.
c. Click
to move the EMAIL up in the list box to follow the SR_Flag field.
d. Click the JOB TITLE field in the Selected list box.
e. Click
f.
to move the JOB TITLE up in the list box to follow the EMAIL field.
Click OK to close the Text File Output Properties window.
28. Save the job.
a. Click the Data Flow tab.
b. Select File  Save.
4-124
Chapter 4 ACT
29. Run the job.
a. Verify that the Data Flow tab is selected.
b. Select Actions  Run Data Job.
The text file appears.
4.4 Entity Resolution Jobs
4-125
Without the field rules, the output would be the following:
With the field rules in place, important information that is potentially spread across multiple records
of a cluster can be retrieved for the surviving record. In this case, with the field rules in place, the
surviving record with ID=63 has an e-mail address as well as a job title.
30. Right-click the Text File Output node (labeled Test) and select Delete.
31. Select Edit  Clear Run Results  Clear All Run Results.
4-126
Chapter 4 ACT
32. Add an Entity Resolution File Output node to the job flow.
a. If necessary, click
in front of the Data Outputs grouping of nodes.
b. Double-click the Entity Resolution File Output node.
The node appears in the data flow. The Entity Resolution File Output Properties window appears.
33. Specify properties for the Entity Resolution File Output node.
a. Type ER File in the Name field.
b. Click
next to Cluster ID field and select Cluster_ID.
c. Verify that the Source table field has a value of Customers.
d. Click
next to the Output file field.
1) Type Ch4D8_EntityResolutionResults in the Name field.
2) Navigate to the Basics Demos  output_files folder.
3) Click Save.
e. Click Display file after job runs.
f.
In the Primary keys area, double-click ID from the Available fields list box to move it to the
Selected fields list box.
g. Click Options.
1) Click Display single-record clusters in the output file.
2) Click
next to Surviving record ID field and select SR_Flag.
3) Click Surviving record ID field contains primary key value.
4) Click OK to close the Options window.
h. Click Edit Field Settings.
4.4 Entity Resolution Jobs
4-127
1) Double-click EMAIL to move it from the Available fields list box to the Selected fields list
box.
2) Double-click JOB TITLE to move it from the Available fields list box to the Selected fields
list box.
3) Double-click NOTES to move it from the Available fields list box to the Selected fields list
box.
4) Click OK to close the Edit Field Settings window.

i.
You want to select fields affected by the field rules.
Click Target Settings.
1) Click Flat file.
a) Click
next to the File name field.
(1) Navigate to S:\Workshop\dqdmp1\Demos\files\output_files.
(2) Type Ch4D8_Demo_ERF_Target.txt in the File name field.
(3) Click Save.
b) Verify that Comma is the value for the Delimiter field.
c) Select “ (double quotation mark) as the value for the Text field.
2) Click Flag duplicate records for deletion.
3) Click
under the Delete Flag field and select SR_Flag.
4) Click
next to the Audit file name field.
a) Navigate to S:\Workshop\dqdmp1\Demos\files\output_files.
b) Type Ch4D8_Demo_ERF_Audit.txt in the File name field.
c) Click Save.
5) Click OK to close the Target Settings window.
j.
Click OK to close the Entity Resolution File Output Properties window.
34. Save the job.
a. Click the Data Flow tab.
b. Select File  Save.
35. Run the job.
a. Verify that the Data Flow tab is selected.
b. Select Actions  Run Data Job.
4-128
Chapter 4 ACT
c. Verify that the Ch4D8_EntityResolutionResults tab appears.
4.4 Entity Resolution Jobs
36. Review the entity resolution file. Use the Entity Resolution Viewer.
a. Verify that the Cluster tab is selected.
A list of the clusters contained in the entity resolution file is displayed.
b. If necessary, click
(the last tool on the toolbar) to reveal the Cluster Analysis pane.
c. Move the cursor over the bar where the record count equals 1.
Tooltip text appears and states that 53 clusters consist of one record.
4-129
4-130
Chapter 4 ACT
d. Move the cursor over the bar where the record count equals 2.
Tooltip text appears and states that two clusters consist of two records.
e. Move the cursor over the bar where the record count equals 3.
Tooltip text appears and states that two clusters consist of three records.
f.
Click the bar with a record count of 3 (three).
The Cluster Details pane is suppressed and the two clusters with a record count of three are
highlighted in the cluster listing.
g. Click Cluster 22 in the cluster listing.
h. If necessary, select View  Show Cluster Details Pane.
4.4 Entity Resolution Jobs
4-131
The top row shows the surviving record based on the rules specified in the Surviving Record
Identification node.
• Maximum value of ID.
• EMAIL is not null.
• JOB TITLE is not null AND Longest Value of JOB TITLE.
i.
Click
(on Cluster tab).
In this Edit mode, examining the Cluster records area shows that the selected record and the
surviving record area have values from the surviving record that are overwritten according to
field rule specifications.
1) Double-click the value for the Address field of the Surviving record.
This action makes the field editable.
2) Replace the current value of 333 Christian Street with the value 333 Christian St.
3) Press ENTER.
Notice that the edited value is bold.
4) Click
.
4-132
Chapter 4 ACT
A message window appears.
5) Click Yes.
Cluster 22 now appears as dimmed and is no longer editable.
j.
Type 35 in the Go to cluster field.
k. Click
.
1) Double-click the value for the BUSINESS PHONE field of the Surviving record.
2) Replace the current value of (718)555-0100 with the value 718-555-0100.
3) Press ENTER.
4) Click
.
5) Click Yes in the message window.
Cluster 35 now appears as dimmed and is no longer editable.
These actions of reviewing each of the clusters and possibly making corrections can be continued
for each of the clusters, including the single record clusters.
4.4 Entity Resolution Jobs
l.
4-133
Click the Clusters thread to stop editing.
The Cluster tab appears.
A check mark appears next to the two clusters where changes are applied.
m. Select File  Close Entity Resolution File.
n. Review the audit file that was produced.
1) Select Start  All Programs  Accessories  Windows Explorer to access a Windows
Explorer window.
2) Navigate to S:\Workshop\dqdmp1\Demos\files\output_files.
3) Double-click Ch4D8_Demo_ERF_Audit.txt.
4-134
Chapter 4 ACT
A Notepad window appears with the results of the cluster edits.
4) Select File  Exit to close the Notepad window.
o. Review the target file that was produced.
1) Access the Windows Explorer window.
2) If necessary, navigate to S:\Workshop\dqdmp1\Demos\files\output_files.
3) Double-click Ch4D8_Demo_ERF_Target.txt.
The information in the target file consists of the non-surviving records from the two clusters that
were edited.
4) Select File  Exit to close the Notepad window.
37. Select File  Close.
4.4 Entity Resolution Jobs
4-135
Exercises
5. Selecting the Best Record
Create a data job that works with the Surviving Record Identification node. The final job flow should
resemble the following:
•
Copy the data job Ch4E4_Manufacturers_MatchReport located in the batch_jobs folder of the
Basics Exercises repository. Paste it in the batch_jobs folder of the Basics Exercises. Rename it
Ch4E5_Manufacturers_SelectBestRecord.
•
Remove the Match Report node and add a Surviving Record Identification node with the
following properties:
Name:
Select Best Record
Cluster ID field:
Cluster_ID
Options:

Keep duplicate records.

Specify SR_ID in the Surviving record ID field.

Use the primary key as the surviving record ID.

Specify the ID in the Primary key field.
Record rule:
Maximum value of POSTDATE
Field rule:
Highest occurrence of CONTACT
4-136
•
Chapter 4 ACT
Add a text file output node with the following properties:
Name:
Write Results
Output file:
Ch4E5_Manufacturer_Best_Record.txt in the directory
S:\Workshop\dqdmp1\Exercises\files\output_files
File attributes:

“ (double-quotation mark) as the text qualifier

Comma as the field delimiter

Include header row

Display file after job runs
Output fields:
Cluster_ID
ID
SR_ID
POSTDATE
MANUFACTURER
CONTACT
CONTACT_ADDRESS
CONTACT_CITY
CONTACT_STATE_PROV
CONTACT_POSTAL_CD
CONTACT_CNTRY
CONTACT_PHONE
•
Save and run the data job.
•
Examine the output text file. Do the results make sense?
4.4 Entity Resolution Jobs
4-137
Cluster Difference
73
7
3
The Cluster Diff node is used to compare two sets of clustered records by reading in data from a left and a
right table (side). From each side, the Cluster Diff node takes two inputs: a numeric record ID field and a
cluster number field.
Possible values for diff types include the following:
COMBINE
A record belongs to a set of records from one or more clusters in the left table that are
combined into a larger cluster in the right table.
DIVIDE
A record belongs to a set of records in a cluster in the left table that is divided into two or
more clusters in the right table.
NETWORK
A record belongs to a set of records that are involved in one or more different multirecord clusters in the left and right tables.
.
A record belongs to a set of records that are in the same cluster in both the left and right
tables.
4-138
Chapter 4 ACT
Creating a Data Job to Compare Clusters (Optional)
In this demonstration, you investigate the Cluster Diff node and the results that it can produce.
1. If necessary, select Start  All Programs  DataFlux  Data Management Studio 2.2.
2. Verify that the Home tab is selected.
3. Click the Folders riser bar.
4. Copy an existing job to use as a starting point.
a. Click
in front of the Basics Demos repository.
b. Click
in front of the batch_jobs folder.
c. Right-click the Ch4D7_ClusterRecords_Example job and select Copy.
d. Right-click the batch_jobs folder and select Paste.
A message window appears.
e. Click Rename the new item (Copy of …, Copy (2) of …, etc.).
f.
Click OK.
5. Rename the copied job.
a. Right-click Copy of Ch4D7_ClusterRecords_Example and select Rename.
b. Type Ch4D9_ClusterDiff_Example.
c. Press ENTER.
6. Double-click Ch4D9_ClusterDiff_Example.
The data job opens on a new tab.
4.4 Entity Resolution Jobs
4-139
7. Remove the last three nodes.
a. Right-click the Standardization node (labeled Standardize Phone) and select Delete.
b. Right-click the Clustering node (labeled Cluster on Three Conditions) and select Delete.
c. Right-click the Match Report node (labeled Customers Match Report) and select Delete.
8. Update the Match Codes node.
a. Right-click the Match Codes node and select Properties.
b. Double-click the EMAIL field to move it from Available to Selected.
c. Select E-mail match code definition.
d. Verify the 85 sensitivity is specified.
e. Click OK to close the Match Codes Properties window.
9. Add a Branch node to the job flow.
a. In the Nodes resource pane, click
in front of the Utilities grouping of nodes.
b. Drag the Branch node to the Data Flow tab. The Branch node appears on the job flow but is not
connected to any node.
c. Establish node connections and a description for the Branch node.
1) Click the new Branch node in the data flow.
2) Verify that the Details pane is displayed.
3) If necessary, click the Basic Settings tab.
4) Type Branch as the value for the Name field.
5) Type Branch as the value for the Description field.
6) Click the Node Connections tab.
. The Add Nodes window appears.
7) In the Connect from area, click
8) Click the Match Codes node (with a name of Match Codes for Various Fields).
9) Click OK to close the Add Nodes window.
10. Add a Clustering node to the job flow.
a. Click the Branch node in the job flow.
b. In the Nodes resource pane, click
c. Click
to collapse the Utilities grouping of nodes.
in front of the Entity Resolution grouping of nodes.
d. Double-click the Clustering node.
The node appears in the job flow connected from the Branch node.
11. Specify properties for the new Clustering node.
4-140
Chapter 4 ACT
a. Type Clustering Using Email in the Name field.
b. Type Cluster_EMAIL in the Output cluster ID field field.
c. Verify that Sort output by cluster number is not selected.
d. Verify that All clusters is selected in the Cluster area.
e. Specify the one condition.
1) Double-click Name_MatchCode to move from the Available fields list box to the Selected
fields list box.
2) Double-click COMPANY_MatchCode to move from the Available fields list box to the
Selected fields list box.
3) Double-click EMAIL_MatchCode to move from the Available fields list box to the Selected
fields list box.
f.
Click OK to close the Clustering Properties window.
g. Establish a description for the Clustering node.
1) Click the new Clustering node in the data flow.
2) Verify that the Details pane is displayed.
3) If necessary, click the Basic Settings tab.
4) Type Clustering as the value for the Description field.
12. Add a second Clustering node to the job flow.
a. Click the Branch node in the job flow.
b. If necessary, click
in front of the Entity Resolution grouping of nodes.
c. Double-click the Clustering node.
The node appears in the job flow connected from the Branch node.
13. Specify properties for the new Clustering node.
a. Type Clustering Using Address Information in the Name field.
b. Type Cluster_Address in the Output cluster ID field field.
c. Verify that Sort output by cluster number is not selected.
d. Verify that All clusters is selected in the Cluster area.
e. Specify two conditions.
1) Double-click Name_MatchCode (to move it from Available to Selected).
2) Double-click COMPANY_MatchCode (to move it from Available to Selected).
3) Double-click ADDRESS_MatchCode (to move it from Available to Selected).
4.4 Entity Resolution Jobs
4-141
4) Double-click STATE/PROVINCE_MatchCode (to move it from Available to Selected).
5) Click OR.
6) Double-click Name_MatchCode (to move it from Available to Selected).
7) Double-click COMPANY_MatchCode (to move it from Available to Selected).
8) Double-click ADDRESS_MatchCode (to move it from Available to Selected).
9) Double-click ZIP/POSTAL CODE_MatchCode (to move it from Available to Selected).
The conditions should resemble the following:
f.
Click OK to close the Clustering Properties window.
g. Establish a description for the Clustering node.
1) Click the new Clustering node in the data flow.
2) Verify that the Details pane is displayed.
3) If necessary, click the Basic Settings tab.
4) Type Clustering as the value for the Description field.
The job flow should now resemble the following:
4-142
Chapter 4 ACT
14. Add a Cluster Diff node to the job flow.
a. In the Nodes resource pane, verify that the Entity Resolution grouping of nodes is expanded.
b. Right-click the Cluster Diff node and select Insert Node.
c. Position the new Cluster Diff node to the right of and between the two clustering nodes.
d. Establish node connections and a description for the Cluster Diff node.
1) Click the new Cluster Diff node in the data flow.
2) Verify that the Details pane is displayed.
3) If necessary, click the Basic Settings tab.
4) Type Cluster Diff as the value for the Name field.
5) Type Cluster Diff as the value for the Description field.
6) Click the Node Connections tab.
7) In the Connect from area, click Slot 0. Then click
The Add Nodes window appears.
.
8) Click the Clustering Using Email node.
9) Click OK to close the Add Nodes window.
10) In the Connect from area, click Slot 1. Then click
The Add Nodes window appears.
11) Click the Clustering Using Address Information node.
12) Click OK to close the Add Nodes window.
.
4.4 Entity Resolution Jobs
The connections to the Cluster Diff node should now be defined.
15. Specify properties for the new Cluster Diff node.
a. Right-click the Cluster Diff node and select Properties.
b. Type Cluster Diff in the Name field.
c. For the Left table, click
under Record ID and select ID.
d. For the Left table, click
under Cluster number and select Cluster_EMAIL.
e. For the Right table, click
under Record ID and select ID.
f.
under Cluster number and select Cluster_Address.
For the Right table, click
g. Type Diff_Set in the Diff set field in the Output field area.
h. Type Diff_Type in the Diff type field in the Output field area.
i.
Verify that Skip rows with “same” diff type is selected.
j.
Click Additional Outputs.
1) Click
to move all fields to the Output fields list box.
2) Click OK to close the Additional Outputs window.
k. Click OK to close the Cluster Diff Properties window.
16. Add an HTML Report node to the job flow.
a. Click the Cluster Diff node in the job flow.
b. In the Nodes resource pane, click
c. Click
to collapse the Entity Resolution grouping of nodes.
in front of the Data Outputs grouping of nodes.
d. Double-click the HTML Report node.
4-143
4-144
Chapter 4 ACT
The node appears in the job flow.
17. Specify properties for the new HTML Report node.
a. Type HTML Report in the Name field.
b. Type Customer Cluster Difference Report in the Report title field.
c. Specify the output file information.
1) Click
next to the Output file field.
2) Navigate to S:\Workshop\dqdmp1\Demos\files\output_files.
3) Type Ch4D9_ClusterDiffReport.html in the File name field.
4) Click Save.
d. Click Display report in browser after job run.
e. Click
f.
to move all fields to the Selected list box.
If necessary, rearrange the selected fields so that the following fields appear first:
Diff_Set
Diff_Type
ID
Cluster_EMAIL
Cluster_Address
g. Click OK to close the HTML Report Properties window.
The final job flow should resemble the following:
4.4 Entity Resolution Jobs
18. Save the job.
a. Click the Data Flow tab.
b. Select File  Save.
19. Run the job.
a. Verify that the Data Flow tab is selected.
b. Select Actions  Run Data Job.
The HTML file appears.
20. Close the browser when you are finished viewing the HTML report.
4-145
4-146
Chapter 4 ACT
21. Review the log.
22. Select File  Close.
4.5 Multi-Input/Multi-Output Data Jobs (Self-Study)
4.5 Multi-Input/Multi-Output Data Jobs (Self-Study)
Objectives



Explore pre-built jobs.
Understand how to specify and verify the node
connections.
Review several “new” nodes.
77
New Products Example
A data job that has two inputs and
two outputs is explored. This job
involves three new nodes:
 Data Joining
 Branch
 Sequencer (Autonumber)
78
4-147
4-148
Chapter 4 ACT
Customer Matches Example
A data job with two inputs and
two outputs is explored.
The generated match codes
are used to provide conditions
for data joining.
79
4.5 Multi-Input/Multi-Output Data Jobs (Self-Study)
4-149
Multi-Input/Multi-Output Data Job: New Products
In this demonstration, you examine a data job that takes advantage of two inputs and creates two outputs.
The data job uses a Data Joining node, a Branch node, and a Sequencer (Autonumber) node. Each node’s
properties are explored, as well as the data connection information.
1. If necessary, select Start  All Programs  DataFlux  Data Management Studio 2.2.
2. Verify that the Home tab is selected.
3. Click the Folders riser bar.
4. Access an existing job.
a. Click

in front of the Basics Solutions repository.
To establish/connect to the Basics Solutions repository (if necessary), do the following:
5. Click the Administration riser bar.
a. Right-click Repository Definitions and select New Repository Definition.
b. Type Basics Solutions as the value for the Name field.
c. Click Browse next to Database file field and navigate to
S:\Workshop\dqdmp1\solutions\solutions.rps.
d. Click Browse next to Folder field (under File storage) and navigate to
S:\Workshop\dqdmp1\solutions\files.
e. Clear Private.
f.
Click OK.
g. Click batch_jobs folder.
h. In the main area, double-click the data job named Ch4D10_AddNewProducts.
The data job diagram should resemble the following:
4-150

Chapter 4 ACT
The job flow diagram accessed may be “vertical”. In the above picture the job diagram is
horizontal for display purposes. Also, the job flow diagram accessed will have sticky note
objects that are not displayed in the above picture.
6. Review the Data Source node.
a. Right-click the Data Source node and select Properties.
b. Verify that the Input table field displays the Products table from the dfConglomerate Gifts data
connection.
c. Verify that all fields from the Products table are selected.
d. Click OK to close the Data Source Properties window.
7. Review the data in the dfConglomerate Products table.
a. Click the Home tab.
b. Click the Data riser bar.
c. Click
in front of Data Connections.
d. Click
in front of the dfConglomerate Gifts data connection.
e. Click the Products table.
The main area displays information for the Products table.
f.
If necessary, click the Data tab.
g. Verify that there are 29 records in the Products table.
8. Review the Text File Input node.
a. Click the main tab Ch4D10_AddNewProducts.
b. If necessary, click the Data Flow tab.
c. Right-click on Text File Input node and select Properties.
d. Review the input file specifications.
4.5 Multi-Input/Multi-Output Data Jobs (Self-Study)
4-151
e. Verify that 5 (five) fields are specified.
f.
Click OK to close the Text File Input Properties window.
9. Preview the data from the text file input.
a. Verify the Details panel is shown (if necessary, select View  Show Details Pane).
b. Right-click the text file input node and select Preview.
c. Verify that there are 14 records in the input text file.
10. Review the connections for the Data Joining node.
a. Verify the Details panel is shown (if necessary, select View  Show Details Pane).
b. Click the Data Joining node.
c. From the Details pane, click the Node Connections tab.
d. Verify that the Connect from pane has two slots - one is filled with the Data Source node and one
is filled with the Text File Input node.
11. Review the Data Joining node.
a. Right-click the Data Joining node and select Properties.
b. Verify that the join type is set to Right.
c. Verify that the join criteria (expression) is the equality of PRODUCT CODE from the Products
data source and PRD_CODE from the text file input.
4-152
Chapter 4 ACT
d. In the Output fields area, verify that the fields from the Products data source (the left table) have
updated output names - each field ends with the text _1.
e. In the Output fields area, verify that the fields from the text file input (the right table) have
updated output names - each field ends with the text _2.

f.
The Data Joining node enables multiple inputs. The first node connected to this
node is always considered the left side and the second connected node is
considered the right side.
Click OK to close the Data Joining Properties window.
4.5 Multi-Input/Multi-Output Data Jobs (Self-Study)
4-153
12. Review the connections for the Branch node.
a. Verify the Details panel is shown (if necessary, select View  Show Details Pane).
b. Click the Branch node.
c. From the Details pane, click the Node Connections tab.
d. Verify that the Connect to pane has two slots - one is filled with a Data Validation node labeled
No Match and one is filled with a Data Validation node labeled Match.
13. Review the Branch node.
a. Right-click the Branch node and select Properties.

The Branch node enables multiple outputs. It is typically followed by two or more
Data Validation nodes to filter records down a particular path. A Branch node can
have up to 32 connections.

The Memory cache size field specifies the amount of memory (in Megabytes)
allocated to this step.

The Land all data locally before processing continues check box, when selected,
ensures that all data from the data sources is placed on the local machine before
the job continues to be processed. (Selecting this option can reduce the job
performance.)
b. Click OK to close the Branch Properties window.
4-154
Chapter 4 ACT
14. Review the Data Validation node labeled Match.
a. Right-click the Match Data Validation node and select Properties.
b. Verify that the expression is ID_1 is not null.

Because this is a right join, if a match occurs (that is, if PRODUCT CODE equals
PRD_CODE), then the ID_1 field has a value.
c. Click OK to close the Data Validation Properties window.
15. Review the Data Validation node labeled No Match.
a. Right-click the No Match Data Validation node and select Properties.
b. Verify that the expression is ID_1 is null.

Because this is a right join, if a match does not occur (that is, if PRODUCT CODE does
not equal PRD_CODE), then the ID_1 field does not have a value.
c. Click OK to close the Data Validation Properties window.
16. Review the Text File Output node labeled Products Matched.
a. Right-click the Text File Output node and select Properties.
b. Review the output file specifications.
c. Verify that all input (Available) fields are selected for output.
d. Click OK to close the Text File Output Properties window.
4.5 Multi-Input/Multi-Output Data Jobs (Self-Study)
17. Review the Sequencer (Autonumber) node labeled Create Unique ID.
a. Right-click the Sequencer (Autonumber) node and select Properties.
b. Verify that the field to be created (Field name field) is named PK.
c. Verify that the Start number field is set to 91.
d. Verify that the Interval field is set to 1.
e. Click OK to close the Sequencer (Autonumber) Properties window.
4-155
4-156
Chapter 4 ACT
18. Review the Data Target (Insert) node.
a. Right-click the Data Target (Insert) node and select Properties.
b. Verify the fields from the text file (field names end with _2) are in the Selected list and are
renamed to match their corresponding fields in the existing Products table.
c. Click OK to close the Data Target (Insert) Properties window.
19. If necessary, save the job.
a. Click the Data Flow tab.
b. Select File  Save.
20. Run the job.
a. Verify that the Data Flow tab is selected.
b. Select Actions  Run Data Job.
4.5 Multi-Input/Multi-Output Data Jobs (Self-Study)
The CSV file opens in Microsoft Excel.
c. Select File  Exit to close Microsoft Excel. (If you are prompted, do not save the changes.)
21. Click the Log tab to review the log.
Verify that 8 rows were added to the Products table.
4-157
4-158
Chapter 4 ACT
22. Click the Data Flow tab.
23. Select File  Close.
24. Review the new table.
a. Click the Data Connections riser bar.
b. Click
in front of the dfConglomerate Gifts data connection.
c. Click the Products table.
d. If necessary, click the Data tab.
e. Verify that there are 37 rows in the Products table. (Therefore, eight records were added.)
4.5 Multi-Input/Multi-Output Data Jobs (Self-Study)
4-159
Multi-Input/Multi-Output Data Job: Customer Matches
In this demonstration, you examine a data job that takes advantage of two inputs and creates two outputs.
The data job generates match codes on a variety of fields from each of the source then performs a data
join using various conditions that involve the match codes. The matches will be written to a text file, as
will the non-matches.
1. If necessary, select Start  All Programs  DataFlux  Data Management Studio 2.2.
2. Verify that the Home tab is selected.
3. Click the Folders riser bar.
4. Copy an existing job to use as a starting point.
a. Click
in front of the Basics Solutions repository.
b. Click the batch_jobs folder.
In the main area, double-click the data job named Ch4D11_CustomerMatches. The data job
diagram should resemble the following:

The job flow diagram accessed may be “vertical”. In the above picture the job diagram is
horizontal for display purposes. Also, the job flow diagram accessed will have sticky note
objects that are not displayed in the above picture.
5. Review the Data Source node.
a. Right-click the Data Source node and select Properties.
b. Verify that the Input table field displays the Customers table from the dfConglomerate Gifts
data connection.
c. Verify that all fields from the Customers table are selected.
d. Click OK to close the Data Source Properties window.
4-160
Chapter 4 ACT
6. Review the data in the dfConglomerate Customers table.
a. Click the Home tab.
b. Click the Data Connections riser bar.
c. Click
in front of the dfConglomerate Gifts data connection.
d. Click the Customers table.
The main area displays information for the Customers table.
e. If necessary, click the Data tab.
f.
Notice that there are 63 records in the Customers table.
7. Review the Text File Input node.
a. Click the Data Flow tab.
b. Right-click the Text File Input node and select Properties.
c. Review the input file specifications.
d. Verify that 8 (eight) fields are specified.
e. Click OK to close the Text File Input Properties window.
8. Review the Match Codes (Parsed) node.
a. Right-click the Match Codes (Parsed) node and select Properties.
b. Verify that Output field has a value of Name_Matchcode.
4.5 Multi-Input/Multi-Output Data Jobs (Self-Study)
c. Verify that the Name definition is applied at the 85 sensitivity.
d. Verify that two tokens are used to form the output match code field.
e. Click OK to close the Match Codes (Parsed) Properties window.
9. Review the Match Codes node that follows the Match Codes (Parsed) node.
a. Right-click the Match Codes node and select Properties.
b. Verify that 6 (six) fields have match codes generated using an appropriate definition.
c. Verify that all fields generate match codes at the 85 sensitivity.
d. Click OK to close the Match Codes Properties window.
10. Review the Match Codes node that follows the Text File Input node.
a. Right-click the Match Codes node and select Properties.
b. Verify that 7 (seven) fields have match codes generated using an appropriate definition.
c. Verify that all fields generate match codes at the 85 sensitivity.
d. Click OK to close the Match Codes Properties window.
11. Review the Sequencer (Autonumber) node.
a. Right-click the Sequencer (Autonumber) node and select Properties.
b. Verify that the field to be created (Field name field) is named ID.
4-161
4-162
Chapter 4 ACT
c. Verify that the Start number field is set to 1.
d. Verify that the Interval field is set to 1.
e. Click OK to close the Sequencer (Autonumber) Properties window.
12. Review the Data Joining node.
a. Right-click the Data Joining node and select Properties.
b. Verify that the Join type is set to Right.
c. Verify that the join criteria involves one of four different conditions.
d. In the Output fields area, verify that the fields from the Customers data source (the left table)
have updated Output Names. That is, each field ends with the text _1.
e. In the Output fields area, verify that the fields from the text file input (the right table) have
updated Output Names. That is, each field ends with the text _2.
f.
Click OK to close the Data Joining Properties window.
13. Review the Data Validation node labeled Text File - Record Matches DB Record.
a. Right-click the Data Validation node and select Properties.
b. Verify that the expression is ID_1 is not null.

Because this is a right join, if a match occurs (if at least one of the four conditions specified
in the data joining is met), then the ID_1 field has a value.
c. Click OK to close the Data Validation Properties window.
14. Review the Data Validation node labeled Text File - Record Does Not Match DB Record.
a. Right-click this Data Validation node and select Properties.
b. Verify that the expression is ID_1 is null.
4.5 Multi-Input/Multi-Output Data Jobs (Self-Study)

4-163
Because this is a right join, if a match does not occur (if at least one of the four conditions
specified in the data joining is met), then the ID_1 field does not have a value.
c. Click OK to close the Data Validation Properties window.
15. Review the Text File Output node labeled Matches.
a. Right-click the Text File Output node and select Properties.
b. Review the output file specifications.
c. Verify that all input (Available) fields are selected for output.
d. Click OK to close the Text File Output Properties window.
16. Review the Text File Output node labeled Non-Matches.
a. Right-click the Text File Output node and select Properties.
b. Review the output file specifications.
c. Verify that only the input (Available) fields with names ending in _2 are selected for output.
d. Click OK to close the Text File Output Properties window.
17. If necessary, save the job.
a. Click the Data Flow tab.
b. Select File  Save.
18. Run the job.
a. Verify that the Data Flow tab is selected.
b. Select Actions  Run Data Job.
4-164
Chapter 4 ACT
The two CSV files open in Microsoft Excel.
c. Select File  Exit to close each of the Microsoft Excel sessions. (If you are prompted, do not
save the changes.)
4.5 Multi-Input/Multi-Output Data Jobs (Self-Study)
19. Click the Log tab to review the log.
20. Select File  Close.
4-165
4-166
Chapter 4 ACT
Exercises
6. Creating a Multi-Input/Multi-Output Data Job
In this exercise, you create a final job flow that should resemble the following:
4.5 Multi-Input/Multi-Output Data Jobs (Self-Study)
4-167
•
Create a new data job named Ch4E6_MultiInputOutput in the batch_jobs folder of the Basics
Exercises repository.
•
Add the MANUFACTURERS table in the dfConglomerate Grocery data connection as the data
source. Select the following fields:
ID
MANUFACTURER
CONTACT
CONTACT_ADDRESS
CONTACT_CITY
CONTACT_STATE_PROV
CONTACT_POSTAL_CD
CONTACT_CNTRY
CONTACT_PHONE
POSTDATE
•
Standardize the following fields with the corresponding definitions.
•
•
Field Name
Definition
MANUFACTURER
Organization
CONTACT
Name
CONTACT_ADDRESS
Address
CONTACT_STATE_PROV
State/Province(Abbreviation)
CONTACT_PHONE
Phone
Use the specified definitions and sensitivities. Accept the default names for the match code fields.
Be sure to generate null match codes for blank field values and to preserve null values. Generate
match codes for the following fields
Field Name
Definition
Sensitivity
MANUFACTURER_Stnd
Organization
75
CONTACT_Stnd
Name
75
CONTACT_ADDRESS_Stnd
Address
85
CONTACT_STATE_PROV_Stnd
State/Province
85
CONTACT_POSTAL_CD
Postal Code
85
Also add a text input as input. The text file uses “ (double quotation mark) to qualify data, is
comma-delimited, and has a header row.
Text file: S:\Workshop\dqdmp1\data\Text Files\Manufacturer_Contact_List.txt
DFL file: S:\Workshop\dqdmp1\data\Text Files\Manufacturer_Contact_List.dfl
4-168
•
Chapter 4 ACT
Add Performing Address Verification and Geocodinga Standardization node following the Text
File Input node and standardize as follows:
Field Name
Definition
COMPANY
Organization
NAME
Name
WORK_ADDRESS
Address
WORK_STATE
State/Province(Abbreviation)
WORK_PHONE
Phone
a. Use the specified definitions and sensitivities. Accept the default names for the match code fields.
Be sure to generate null match codes for blank field values and to preserve null values. Generate
match codes for the following fields:
Field Name
Definition
Sensitivity
COMPANY_Stnd
Organization
75
NAME_Stnd
Name
75
WORK_ADDRESS_Stnd
Address
85
WORK_STATE_Stnd
State/Province
85
WORK_ZIP
Postal Code
85
•
Create a unique identifier field in the data flowing from the text file. Name the field ID, start the
values at 1 (one), and increment by 1 (one).
•
Add a Data Joining node to the job flow to join the data flowing from the data source with the
data flowing from the text file. The join should be a right join and the join criteria can be one of
the following:
MANUFACTURER_Stnd_MatchCode = COMPANY_Stnd_MatchCode
CONTACT_Stnd_MatchCode = NAME_Stnd_MatchCode
CONTACT_ADDRESS_Stnd_MatchCode = WORK_ADDRESS_Stnd_MatchCode
CONTACT_STATE_PROV_Stnd_MatchCode = WORK_STATE_Stnd_MatchCode
or
MANUFACTURER_Stnd_MatchCode = COMPANY_Stnd_MatchCode
CONTACT_Stnd_MatchCode = NAME_Stnd_MatchCode
CONTACT_ADDRESS_Stnd_MatchCode = WORK_ADDRESS_Stnd_MatchCode
CONTACT_POSTAL_CD_MatchCode = WORK_ZIP_MatchCode
or
MANUFACTURER_Stnd_MatchCode = COMPANY_Stnd_MatchCode
CONTACT_Stnd_MatchCode = NAME_Stnd_MatchCode
CONTACT_PHONE_Stnd = WORK_PHONE_Stnd
CONTACT_POSTAL_CD_MatchCode = WORK_ZIP_MatchCode
•
Add a Branch node following the Data Joining node.
4.5 Multi-Input/Multi-Output Data Jobs (Self-Study)
4-169
•
Add two Data Validation nodes. One searches for records where ID_1 is not null (matches) and
the other searches for records where ID_1 is null (non-matches).
•
Add two text file output: one to “hold” the matches and one for the non-matches. Create each file
to use “ (double quotation mark) to qualify data. It should be comma-delimited. Add a header
row.
Add the following fields to the Matches text file:
Field Name
Output Name
ID_1
DB_ID
MANUFACTURER_1
MANUFACTURER
CONTACT_1
CONTACT
ID_2
TEXT_ID
COMPANY_2
COMPANY
NAME_2
NAME
Add the following fields to the Non-matches text file:
•
Field Name
Output Name
COMPANY_2
COMPANY
NAME_2
NAME
WORK_ADDRESS_2
WORK_ADDRESS
WORK_CITY_2
WORK_CITY
WORK_STATE_2
WORK_STATE
WORK_ZIP_2
WORK_ZIP
WORK_PHONE_2
WORK_PHONE
Save and run the job.
QUESTION: How many records were written to the Matches text file?
Answer:
QUESTION: How many records were written to the Non-Matches text file?
Answer:
4-170
Chapter 4 ACT
4.6 Solutions to Exercises
1. Create a Simple Data Job
a. If necessary, select Start  All Programs  DataFlux  Data Management Studio 2.2.
b. Verify that the Home tab is selected.
c. Click the Folders riser bar.
d. Click the Basics Exercises repository.
e. Click
 Data Job.
1) Double-click the batch_jobs folder. (This action makes this folder the value of the Save in
field.)
2) Type Ch4E1_Breakfast_Items_OZ in the Name field.
3) Click OK.
The new data job appears on a tab.
f. Add the Data Source node to the Data Flow Editor.
1) Verify that the Nodes riser bar is selected in the Resource pane.
2) Click
in front of the Data Inputs grouping of nodes.
3) Double-click the Data Source node.
4) Type Breakfast Items Table in the Name field.
5) Click
a) Click
next to the Input table field.
in front of the dfConglomerate Grocery data source.
b) Click BREAKFAST_ITEMS.
c) Click OK to close the Select Table window.
6) Click OK to save the changes and close the Data Source Properties window.
The Data Source node appears in the job diagram with updated information displayed.
g. Add a Data Validation node to the job flow.
1) In the Nodes resource pane, click
2) Click
to collapse the Data Inputs grouping of nodes.
in front of the Utilities grouping of nodes.
3) Double-click the Data Validation node.
h. Specify properties for the Data Validation node.
1) Type Filter for OZ UOM in the Name field.
4.5 Multi-Input/Multi-Output Data Jobs (Self-Study)
2) Click
next to Field and select UOM.
3) Click
in the Operation field and select equal to.
4) Click Single (if necessary) and then type OZ.
5) Click Add Condition. The Expression area is updated.
The final settings for the Data Validation node should resemble the following:
6) Click OK to close the Data Validation Properties window.
i. Preview the Data Validation node.
1) If necessary, click the Data Validation node in the job diagram.
2) Verify that the Details pane is displayed.
3) Click the Preview tool (
).
4-171
4-172
Chapter 4 ACT
4) Scroll to the right to view the UOM field.
j. Add a Text File Output node to the job flow.
1) In the Nodes resource pane, click
2) Click
to collapse the Utilities grouping of nodes.
in front of the Data Outputs grouping of nodes.
3) Double-click the Text File Output node.
k. Specify properties for the Text File Output node.
1) Type Breakfast Items - OZ in the Name field.
2) Specify the output file information.
a) Click
next to the Output file field.
b) Navigate to S:\Workshop\dqdmp1\Exercises\files\output_files.
c) Type Ch4E1_Breakfast_Items_OZ.txt in the File name field.
d) Click Save.
3) Specify attributes for the file.
a) Verify that the Text qualifier field is set to “ (double quotation mark).
b) Verify that the Field delimiter field is set to Comma.
c) Click Include header row.
d) Click Display file after job runs.
4) Re-order selected fields for the output file.
a) Click MANUFACTURER_ID in the Selected area.
b) Click
until MANUFACTURER_ID appears before BRAND (and after ID).
4.5 Multi-Input/Multi-Output Data Jobs (Self-Study)
4-173
5) Click OK to save the changes and close the Text File Output Properties window.
l. Add a note to the data flow.
1) Click the Insert Note tool (
).
2) Click and drag the note to position it under the nodes.
3) Type the following as text for the note:
This job:
1. Reads the records from the BreakfastItems table
2. Filters for the records where UOM=OZ
3. Writes the results to a text file
The data job should now resemble the following:
m. Specify properties for the job.
1) Click the Settings tab.
2) Type Create and run a data job in the Description field.
3) Type The job writes Breakfast Items with the Unit of Measure of OZ to a text file in the
Notes field.
n. Select File  Save.
o. Run the job.
4-174
Chapter 4 ACT
1) Verify that the Data Flow tab is selected.
2) Select Actions  Run Data Job.
The job runs and the requested text file output is displayed in a Notepad window.
3) Select File  Exit to close the Notepad window.
4) Notice the processing information on each node.
p. View the log information.
1) Click the Log tab.
2) Review the information for each of the nodes.
q. Select File  Close.
QUESTION: How many records were selected in the filter and therefore written to the text
file?
Answer:
1615 rows
QUESTION: How many records were read from the source table?
Answer:
2617 rows
2. Creating a Data Job that Involves Standardization and Parsing
a. If necessary, select Start  All Programs  DataFlux  Data Management Studio 2.2.
b. Create a new data job.
1) Verify that the Home tab is selected.
2) Click the Folders riser bar.
3) Click the Basics Exercises repository.
4) Click
 Data Job.
4.5 Multi-Input/Multi-Output Data Jobs (Self-Study)
4-175
5) Double-click the batch_jobs folder. (This action makes this folder the value of the Save in
field.)
6) Type Ch4E2_Manufacturers_DataQuality in the Name field.
7) Click OK.
c. Add the Data Source node to the Data Flow Editor.
1) Verify that the Nodes riser bar is selected in the Resource pane.
2) Click
in front of the Data Inputs grouping of nodes.
3) Double-click the Data Source node.
d. Specify the properties of the Data Source node.
1) Type Manufacturers Table in the Name field.
2) Click
3) Click
next to the Input table field.
in front of the dfConglomerate Grocery data source.
4) Click MANUFACTURERS.
5) Click OK to close the Select Table window.
6) In the Output fields area, select only the following fields:
ID
MANUFACTURER
CONTACT
CONTACT_ADDRESS
CONTACT_CITY
CONTACT_STATE_PROV
CONTACT_POSTAL_CD
CONTACT_CNTRY
CONTACT_PHONE
POSTDATE
7) Click OK to save the changes and close the Data Source Properties window.
e. Add a Standardization node to the job flow.
1) In the Nodes resource pane, click
2) Click
to collapse the Data Inputs grouping of nodes.
in front of the Quality grouping of nodes.
3) Double-click the Standardization node.
The node appears in the data flow. The Standardization Properties window appears.
f. Specify properties for the Standardization node.
4-176
Chapter 4 ACT
1) Type Standardize Fields in the Name field.
2) Double-click each of the following fields to move them from the Available list box to the
Selected list box.
MANUFACTURER
CONTACT
CONTACT_ADDRESS
CONTACT_STATE_PROV
CONTACT_CNTRY
3) For the MANUFACTURER field, click
Ch3D8 Company Phrase Scheme.
4) For the CONTACT field, click
under Scheme and select
under Definition and select Name.
5) For the CONTACT_ADDRESS field, click
under Definition and select Address.
6) For the CONTACT_STATE_PROV field, click
State/Province(Abbreviation).
7) For the CONTACT_CNTRY field, click
Ch4E6 CONTACT_CNTRY Scheme.
under Definition and select
under Scheme and select
8) Click Preserve null values.
9) Click OK to close the Standardization Properties window.
g. Preview the Standardization node.
1) If necessary, click the Standardization node in the job diagram.
2) Verify that the Details pane is displayed.
3) Click the Preview tool (
).
A sample of records appears on the Preview tab of the Details panel.
4) Scroll to the right to view the _Stnd fields.
h. Add a Parsing node to the job flow.
1) In the Nodes resource pane, verify that the Quality grouping of nodes is expanded.
2) Double-click the Parsing node.
The node appears in the data flow. The Parsing Properties window appears.
i. Specify properties for the Parsing node.
1) Type Parse Contact_Stnd in the Name field.
2) Click
under Field to parse and select CONTACT_Stnd.
3) Click
under Definition and select Name.
4.5 Multi-Input/Multi-Output Data Jobs (Self-Study)
4-177
4) Double-click each of the following tokens to move them from the Available list box to the
Selected list box.
Given Name
Middle Name
Family Name
5) Change the output name for the Given Name token.
a) Double-click the default output name for the Given Name token.
b) Type FIRST_NAME.
c) Press ENTER.
6) Change the output name for the Middle Name token.
a) Double-click the default output name for the Middle Name token.
b) Type MIDDLE_NAME.
c) Press ENTER.
7) Change the output name for the Family Name token.
a) Double-click the default output name for the Family Name token.
b) Type LAST_NAME.
c) Press ENTER.
8) Click Preserve null values.
4-178
Chapter 4 ACT
The settings in the Parsing Properties window should resemble the following:
9) Click OK to close the Parsing Properties window.
4.5 Multi-Input/Multi-Output Data Jobs (Self-Study)
4-179
j. Preview the Parsing node.
1) If necessary, click the Parsing node in the job diagram.
2) Verify that the Details pane is displayed.
3) Click the Preview tool (
).
A sample of the records appears on the Preview tab of the Details panel.
4) Scroll to the right to view the new, parsed name fields.
k. Add a Data Target (Insert) node to the job flow.
1) In the Nodes resource pane, click
2) Click
to collapse the Quality grouping of nodes.
in front of the Data Outputs grouping of nodes.
3) Double-click the Data Target (Insert) node.
The node appears in the data flow. The Parsing Properties window appears.
l. Specify properties for the Data Target (Insert) node.
1) Type Insert Records on Manufacturers_Stnd Table in the Name field.
2) Click
a) Click
next to the Output table field. The Select Table window appears.
in front of the dfConglomerate Grocery data source.
b) Click dfConglomerate Grocery.
c) Click
to be able to add a table to the selected data connection.
(1) Type MANUFACTURERS_STND in the Enter a name for the new table field.
(2) Click OK to close the New Table window.
4-180
Chapter 4 ACT
d) Click OK to close the Select Table window.
3) Click Delete existing rows.
4) Reorder and rename the selected output fields.
a) Remove all fields from the Selected list box:

Each field can be removed by clicking the field in the Selected list box and
then clicking
. Extended selection is also an option.
b) Move the following fields to the Selected side, in the following order:
ID
MANUFACTURER_Stnd
FIRST_NAME
MIDDLE_NAME
LAST_NAME
CONTACT_ADDRESS_Stnd
CONTACT_CITY
CONTACT_STATE_PROV_Stnd
CONTACT_POSTAL_CD
CONTACT_CNTRY_Stnd
CONTACT_PHONE
POSTDATE
5) Rename the _Stnd fields.
Field Name
Output Name
ID
ID
MANUFACTURER_Stnd
MANUFACTURER
FIRST_NAME
FIRST_NAME
MIDDLE_NAME
MIDDLE_NAME
LAST_NAME
LAST_NAME
CONTACT_ADDRESS_Stnd
CONTACT_ADDRESS
CONTACT_CITY
CONTACT_CITY
CONTACT_STATE_PROV_Stnd
CONTACT_STATE_PROV
CONTACT_POSTAL_CD
CONTACT_POSTAL_CD
CONTACT_CNTRY_Stnd
CONTACT_CNTRY
CONTACT_PHONE
CONTACT_PHONE
POSTDATE
POSTDATE
4.5 Multi-Input/Multi-Output Data Jobs (Self-Study)
4-181
6) Click OK to save the changes and close the Data Target (Insert) Properties window.
m. Save the job.
1) Click the Data Flow tab.
2) Select File  Save.
n. Run the job.
1) Verify that the Data Flow tab is selected.
2) Select Actions  Run Data Job.
3) Notice the processing information on each node.
o. View the log information.
1) Click the Log tab.
2) Review the information for each of the nodes.
p. Select File  Close.
q. View the newly created data.
1) If necessary, click the Home tab.
2) Click the Data riser bar.
3) Click
in front of Data Connections.
4) Click
in front of the dfConglomerate Grocery data connection.
5) Click the MANUFACTURERS_STND table object.
6) In the information area, click the Data tab.
7) Scroll through the data.
3. Performing Address Verification on the MANUFACTURERS Table
a. If necessary, select Start  All Programs  DataFlux  Data Management Studio 2.2.
b. Verify that the Home tab is selected.
c. Click the Folders riser bar.
d. Click the Basics Exercises repository.
e. Click
 Data Job.
1) Double-click the batch_jobs folder. (This action makes this folder the value of the Save in field.)
2) Type Ch4E3_MANUFACTURERS_Verify in the Name field.
3) Click OK.
4-182
Chapter 4 ACT
f. Add the Data Source node to the Data Flow Editor.
1) Verify that the Nodes riser bar is selected in the Resource pane.
2) Click
in front of the Data Inputs grouping of nodes.
3) Double-click the Data Source node.
4) Type Manufacturers Table in the Name field.
5) Click
a) Click
next to the Input table field.
in front of the dfConglomerate Grocery data source.
b) Click MANUFACTURERS.
c) Click OK to close the Select Table window.
6) Click OK to save changes and close the Data Source Properties window.
g. Add an Address Verification (US/Canada) node to the job flow.
1) In the Nodes resource pane, click
2) Click
to collapse the Data Inputs grouping of nodes.
in front of the Enrichment grouping of nodes.
3) Double-click the Address Verification (US/Canada) node.
h. Specify properties for the Address Verification (US/Canada) node.
1) Type Verify US Addresses in the Name field.
2) Verify that United States is selected in the Address area.
3) Specify the Input information.
a) Click
under Field Type for the MANUFACTURER field and click Firm.
b) Click
under Field Type for the STREET_ADDR field and click Address Line 1.
c) Click
under Field Type for the CITY field and click City.
d) Click
under Field Type for the STATE_PROV field and click State.
e) Click
under Field Type for the POSTAL_CD field and click Zip.
4.5 Multi-Input/Multi-Output Data Jobs (Self-Study)
4-183
4) Specify options for the address verification.
a) Click Options.
b) Click Proper case results.
c) Click Output blanks as nulls.
d) Click Street abbreviation.
e) Click City abbreviation.
f) Clear CASS compliance.
g) Verify that Insert dash between ZIP and ZIP4 is selected.
h) Click OK to save the settings and close the Options window.
5) Specify the Output fields’ information.
a) Double-click Firm to move this field from the Available list box to the Selected list box.
b) Double-click Address Line 1 to move this field from the Available list box to the Selected
list box.
c) Double-click City to move this field from the Available list box to the Selected list box.
d) Double-click State to move this field from the Available list box to the Selected list box.
e) Double-click ZIP/Postal_Code to move this field from the Available list box to the
Selected list box.
f) Double-click US_County_Name to move this field from the Available list box to the
Selected list box.
g) Double-click US_Result_Code to move this field from the Available list box to the
Selected list box.
h) For the Output Name field, type _Verified after Firm.
i) For the Output Name field of Address_Line_1, type Address_Verified.
j) For the Output Name field, type _Verified after City.
k) For the Output Name field, type _Verified after State.
l) For the Output Name field of ZIP/Postal Code, type Zip_Verified.
6) Limit the original fields in output.
a) Click Additional Outputs.
b) Click the CONTACT field in the Output fields area, and then click
.
c) Click the CONTACT_ADDRESS field in the Output fields area, and then click
d) Click the CONTACT_CITY field in the Output fields area, and then click
.
.
4-184
Chapter 4 ACT
e) Click the CONTACT_STATE_PROV field in the Output fields area, and then click
f) Click the CONTACT_POSTAL_CD field in the Output fields area, and then click
g) Click the CONTACT_CNTRY field in the Output fields area, and then click
.
h) Click the CONTACT_PHONE field in the Output fields area, and then click
.
i) Click the NOTES field in the Output fields area, and then click
.
j) Click the POSTDATE field in the Output fields area, and then click
.
k) Click OK to close the Additional Outputs window.
7) Click OK to close the Address Verification (US/Canada) Properties window.
i. Preview the Address Verification node.
1) Right-click the Address Verification node and select Preview.
2) Scroll to view the results.
j. Add a Text File Output node to the job flow.
1) In the Nodes resource pane, click
2) Click
to collapse the Enrichment grouping of nodes.
in front of the Data Outputs grouping of nodes.
3) Double-click the Text File Output node.
k. Specify properties for the Text File Output node.
1) Type Manufacturers Info – Verify in the Name field.
2) Specify the output file information.
a) Click
next to the Output file field.
b) Navigate to S:\Workshop\dqdmp1\Exercises\files\output_files.
c) Type Ch4E3_Manufacturer_Verify.txt in the File name field.
d) Click Save.
3) Specify attributes for the file.
a) Verify that the Text qualifier field is set to “ (double quotation mark).
b) Verify that Field delimiter field is set to Comma.
c) Click Include header row.
d) Click Display file after job runs.
4) Specify desired fields for the output file.
a) Click
to move all default selected fields from the list box.
.
.
4.5 Multi-Input/Multi-Output Data Jobs (Self-Study)
b) Specify the following fields with the specified output name:
Field Name
Output Name
ID
ID
Firm_Verified
Manufacturer
Address_Verified
Address
City_Verified
City
State_Verified
State
ZIP_Verified
ZIP
US_County_Name
US_County_Name
US_Result_Code
US_Result_Code
5) Export the field layout.
a) Click Export.
b) If necessary, navigate to S:\Workshop\dqdmp1\Exercises\files\output_files.
c) Type Ch4E3_Manufacturer_Verify.dfl in the File name field.
d) Click Save.
6) Click OK to save the changes and close the Text File Output Properties window.
l. Save the job.
1) Click the Data Flow tab.
2) Select File  Save.
m. Run the job.
1) Verify that the Data Flow tab is selected.
2) Select Actions  Run Data Job.
n. Verify that the text file appears.
o. Select File  Exit to close the Notepad window.
4-185
4-186
Chapter 4 ACT
Notice the processing information on each node.
p. View the log information.
1) Click the Log tab.
2) Review the information for each of the nodes.
q. Select File  Close.
4. Creating a Data Job that Generates a Match Report
a. If necessary, select Start  All Programs  DataFlux  Data Management Studio 2.2.
b. Create a new data job.
1) Verify that the Home tab is selected.
2) Click the Folders riser bar.
3) Click the Basics Exercises repository.
 Data Job.
4) Click
5) Double-click the batch_jobs folder. (This action makes this folder the value of the Save in field.)
6) Type Ch4E4_Manufacturers_MatchReport in the Name field.
7) Click OK.
c. Add the Data Source node to the Data Flow Editor.
1) Verify that the Nodes riser bar is selected in the Resource pane.
2) Click
in front of the Data Inputs grouping of nodes.
3) Double-click the Data Source node.
d. Specify the properties of the Data Source node.
1) Type Manufacturers Table in the Name field.
2) Click
next to the Input table field.
3) Click
in front of the dfConglomerate Grocery data source.
4) Click MANUFACTURERS.
4.5 Multi-Input/Multi-Output Data Jobs (Self-Study)
4-187
5) Click OK to close the Select Table window.
6) In the Output fields area, select only the following fields:
ID
MANUFACTURER
CONTACT
CONTACT_ADDRESS
CONTACT_CITY
CONTACT_STATE_PROV
CONTACT_POSTAL_CD
CONTACT_CNTRY
CONTACT_PHONE
POSTDATE
7) Click OK to save the changes and close the Data Source Properties window.
e. Add a Match Codes node to the job flow.
1) In the Nodes resource pane, click
2) Click
to collapse the Data Inputs grouping of nodes.
in front of the Entity Resolution grouping of nodes.
3) Double-click the Match Codes node.
The node appears in the data flow. The Match Codes Properties window appears.
f. Specify properties for the Match Codes node.
1) Type Match Codes for Various Fields in the Name field.
2) Double-click the MANUFACTURER field to move it from the Available list box to the
Selected list box.
3) Click
under Definition and select Organization.
4) Accept the default Sensitivity of 75.
5) Double-click the CONTACT field to move it from the Available list box to the Selected list
box.
6) Click
under Definition and select Name.
7) Accept the default Sensitivity of 75.
8) Double-click the CONTACT_ADDRESS field to move it from the Available list box to the
Selected list box.
9) Click
under Definition and select Address.
10) Accept the default Sensitivity of 85.
11) Double-click the CONTACT_STATE_PROV field to move it from the Available list box to
the Selected list box.
4-188
Chapter 4 ACT
12) Click
under Definition and select State/Province.
13) Accept the default Sensitivity of 85.
14) Double-click the CONTACT_POSTAL_CD field to move it from the Available list box to
the Selected list box.
15) Click
under Definition and select Postal Code.
16) Accept the default Sensitivity of 85.
17) Double-click the CONTACT_PHONE field to move it from the Available list box to the
Selected list box.
18) Click
under Definition and select Phone.
19) Accept the default Sensitivity of 95.
20) Click Generate null match codes for blank field values.
21) Click Preserve null values.
22) Click OK to close the Match Codes Properties window.
g. Preview the Match Codes node.
1) If necessary, click the Match Codes node in the job diagram.
2) Verify that the Details pane is displayed.
3) Click the Preview tool (
).
A sample of the records appears on the Preview tab of the Details panel.
4) Scroll to the right to view the _MatchCode fields.
h. Add a Clustering node to the job flow.
1) Verify that the Entity Resolution grouping of nodes is expanded.
2) Double-click the Clustering node.
The node appears in the data flow. The Clustering Properties window appears.
i. Specify properties for the Clustering node.
1) Type Cluster on Two Conditions in the Name field.
2) Type Cluster_ID in the Output cluster ID field field.
3) Click Sort output by cluster number.
4) Verify that All clusters is selected in the Cluster area.
5) Specify the first condition.
a) Double-click MANUFACTURER_MatchCode to move from the Available fields list
box to the Selected fields list box.
4.5 Multi-Input/Multi-Output Data Jobs (Self-Study)
4-189
b) Double-click CONTACT_MatchCode to move from the Available fields list box to the
Selected fields list box.
c) Double-click CONTACT_ADDRESS_MatchCode to move from the Available fields list
box to the Selected fields list box.
d) Double-click CONTACT_POSTAL_CD_MatchCode to move from the Available fields
list box to the Selected fields list box.
6) Specify the second condition.
a) Click OR.
b) Double-click MANUFACTURER_MatchCode to move from the Available fields list
box to the Selected fields list box.
c) Double-click CONTACT_MatchCode to move from the Available fields list box to the
Selected fields list box.
d) Double-click CONTACT_STATE_PROV_MatchCode to move from the Available
fields list box to the Selected fields list box.
e) Double-click CONTACT_HONE_MatchCode to move from the Available fields list box
to the Selected fields list box.
7) Click OK to close the Clustering Properties window.
j. Preview the Clustering node.
1) If necessary, click the Clustering node in the job diagram.
2) Verify that the Details pane is displayed.
3) Click the Preview tool (
).
A sample of records appears on the Preview tab of the Details panel.
4) Scroll to the right to view the Cluster_ID field.
5) Scroll down through the observations.

Some records have duplicate values for Cluster_ID.
k. Add a Match Report node to the job flow.
1) In the Nodes resource pane, click
2) Click
to collapse the Entity Resolution grouping of nodes.
in front of the Data Outputs grouping of nodes.
3) Double-click the Match Report node.
The node appears in the data flow. The Match Report Properties window appears.
l. Specify properties for the Match Report node.
1) Type Manufacturers Match Report in the Name field.
4-190
Chapter 4 ACT
2) Click
next to the Report file field.
a) Navigate to S:\Workshop\dqdmp1\Exercises\files\output_files.
b) Type Ch4D7_Manufacturers_MatchReport in the File name field.
c) Click Save.
3) Type Manufacturers Match Report – Two Conditions in the Report Title field.
4) Click Launch Viewer after job is completed.
5) Click
next to Cluster field and select Cluster_ID.
6) Specify the report fields.
a) Click
to move all fields from the Selected list box.
b) Double-click Cluster_ID to move this field from the Available list box to the Selected list
box.
c) Double-click ID to move this field from the Available list box to the Selected list box.
d) Double-click MANUFACTURER to move this field from the Available list box to the
Selected list box.
e) Double-click CONTACT to move this field from the Available list box to the Selected list
box.
f) Double-click CONTACT_ADDRESS to move this field from the Available list box to the
Selected list box.
g) Double-click CONTACT_CITY to move this field from the Available list box to the
Selected list box.
h) Double-click CONTACT_STATE_PROV to move this field from the Available list box
to the Selected list box.
i) Double-click CONTACT_POSTAL_CD to move this field from the Available list box to
the Selected list box.
j) Double-click CONTACT_CNTRY to move this field from the Available list box to the
Selected list box.
k) Double-click CONTACT_PHONE to move this field from the Available list box to the
Selected list box.
7) Click OK to close the Match Report Properties window.
m. Save the job.
1) Click the Data Flow tab.
2) Select File  Save.
n. Run the job.
4.5 Multi-Input/Multi-Output Data Jobs (Self-Study)
1) Verify that the Data Flow tab is selected.
2) Select Actions  Run Data Job.
3) Verify that the dfReport Viewer appears.
4) Select File  Exit to close the dfReport Viewer window.
Notice the processing information on each node.
o. View the log information.
1) Click the Log tab.
2) Review the information for each of the nodes.
p. Click the Data Flow tab.
q. Select File  Close.
5. Selecting the Best Record
a. Create a copy of the Ch4E4_Manufacturers_MatchReport job.
4-191
4-192
Chapter 4 ACT
1) Verify that the Home tab is selected.
2) Click the Folders riser bar.
3) Click
in front of the Basics Exercises repository.
4) Click
in front of the batch_jobs folder.
5) Right-click the Ch4E4_Manufacturers_MatchReport job and select Copy.
6) Right-click the batch_jobs folder and select Paste.
A message window appears.
a) Click Rename the new item (Copy of …, Copy (2) of …, etc.).
b) Click OK.
7) Rename the copied job.
a) Right-click Copy of Ch4E4_Manufacturers_MatchReport and select Rename.
b) Type Ch4E5_Manufacturers_SelectBestRecord.
c) Press ENTER.
b. Double-click Ch4E5_Manufacturers_SelectBestRecord to open the job for editing.
c. Right-click the Match Report node (labeled Manufacturers Match Report) and select Delete.
d. Add a Surviving Record Identification node to the job flow.
1) In the Nodes resource pane, click
in front of the Entity Resolution grouping of nodes.
2) Double-click the Surviving Record Identification node.
The node in the data flow and the Surviving Record Identification Properties window appear.
e. Specify properties for the Surviving Record Identification node.
1) Type Select Best Record in the Name field.
2) Click
next to Cluster ID field and select Cluster_ID.
3) Click Options to the right of the Cluster ID field.
a) Click Keep duplicate records.
b) Type SR_ID in the Surviving record ID field field.
c) Click Use primary key as surviving record ID.
d) Click
next to Primary key field and select ID.
e) Click OK to close the Options window.
4) Add one record rule.
a) Click Add in the Record rules area.
4.5 Multi-Input/Multi-Output Data Jobs (Self-Study)
b) Click
next to Field and select POSTDATE.
c) Click
in the Operation field and select Maximum Value.
4-193
d) Click Add Condition. The Expression area is updated.
e) Click OK to close the Add Record Rule Expression window.
5) Click Field Rules in the Output fields area, lower right.
a) Click Add in the Field Rules window.
b) Click Add in the Rule expressions area of the Add Field Rule window.
c) Click
next to Field and select CONTACT.
d) Click
in the Operation field and select Highest Occurrence.
e) Click Add Condition.
f) Click OK to close the Add Field Rule Expression window.
g) Click OK to close the Add Field Rule window.
6) Click OK to close the Surviving Record Identification Properties window.
f. Add a Text File Output node.
1) In the Nodes resource pane, click
2) Click
to collapse the Entity Resolution grouping of nodes.
in front of the Data Outputs grouping of nodes.
3) Double-click the Text File Output node.
g. Edit the properties of the Text File Output node.
1) Type Write Results in the Name field.
2) Specify the output file information.
a) Click
next to the Output file field.
b) Navigate to S:\Workshop\dqdmp1\Exercises\files\output_files.
c) Type Ch4E5_Manufacturer_Best_Record.txt in the File name field.
d) Click Save.
3) Specify attributes for the file.
a) Verify that the Text qualifier field is set to “ (double quotation mark).
b) Verify that the Field delimiter field is set to Comma.
c) Click Include header row.
d) Click Display file after job runs.
4-194
Chapter 4 ACT
4) Select certain fields for the text file.
a) Click
to move all fields from the Selected list box.
b) Double-click Cluster_ID to move this field from the Available list box to the Selected list
box.
c) Double-click ID to move this field from the Available list box to the Selected list box.
d) Double-click SR_ID to move this field from the Available list box to the Selected list box.
e) Double-click POSTDATE to move this field from the Available list box to the Selected
list box.
f) Double-click MANUFACTURER to move this field from the Available list box to the
Selected list box.
g) Double-click CONTACT to move this field from the Available list box to the Selected list
box.
h) Double-click CONTACT_ADDRESS to move this field from the Available list box to the
Selected list box.
i) Double-click CONTACT_CITY to move this field from the Available list box to the
Selected list box.
j) Double-click CONTACT_STATE_PROV to move this field from the Available list box
to the Selected list box.
k) Double-click CONTACT_POSTAL_CD to move this field from the Available list box to
the Selected list box.
l) Double-click CONTACT_CNTRY to move this field from the Available list box to the
Selected list box.
m) Double-click CONTACT_PHONE to move this field from the Available list box to the
Selected list box.
5) Click OK to close the Text File Output Properties window.
h. Save the job.
1) Click the Data Flow tab.
2) Select File  Save.
i. Run the job.
1) Verify that the Data Flow tab is selected.
2) Select Actions  Run Data Job.
4.5 Multi-Input/Multi-Output Data Jobs (Self-Study)
4-195
The text file appears:
Two groups are highlighted in the above view of the text file.
For Cluster ID=4, the selected record is the record with ID=78. Why?
For Cluster_ID=6, the selected record is the record with ID=11. Why?
3) Select File  Exit to close the Notepad window.
j. Select File  Close.
6. Creating a Multi-Input/Multi-Output Data Job
a. If necessary, select Start  All Programs  DataFlux  Data Management Studio 2.2.
b. Verify that the Home tab is selected.
c. Click the Folders riser bar.
d. Click the Basics Exercises repository.
e. Click
 Data Job.
1) Double-click the batch_jobs folder. (This action makes this folder the value of the Save in
field.)
2) Type Ch4E6_MultiOutput in the Name field.
3) Click OK.
4-196
Chapter 4 ACT
f. Add the Data Source node to the Data Flow Editor.
1) Verify that the Nodes riser bar is selected in the resource pane.
2) Click
in front of the Data Inputs grouping of nodes.
3) Double-click the Data Source node.
4) Type Manufacturers Table in the Name field.
5) Click
next to the Input table field.
a) Click
in front of the dfConglomerate Grocery data source.
b) Click MANUFACTURERS.
c) Click OK to close the Select Table window.
6) In the Output fields area, select only the following fields:
ID
MANUFACTURER
CONTACT
CONTACT_ADDRESS
CONTACT_CITY
CONTACT_STATE_PROV
CONTACT_POSTAL_CD
CONTACT_CNTRY
CONTACT_PHONE
POSTDATE
7) Click OK to save the changes and close the Data Source Properties window.
g. Add a Standardization node to the job flow.
1) In the Nodes resource pane, click
2) Click
to collapse the Data Inputs grouping of nodes.
in front of the Quality grouping of nodes.
3) Double-click the Standardization node.
4) Type Standardize Fields in the Name field.
5) Double-click each of the following fields to move them from the Available list box to the
Selected list box.
MANUFACTURER
CONTACT
CONTACT_ADDRESS
CONTACT_STATE_PROV
CONTACT_PHONE
4.5 Multi-Input/Multi-Output Data Jobs (Self-Study)
6) For the MANUFACTURER field, click
7) For the CONTACT field, click
4-197
under Definition and select Organization.
under Definition and select Name.
8) For the CONTACT_ADDRESS field, click
under Definition and select Address.
9) For the CONTACT_STATE_PROV field, click
State/Province(Abbreviation).
10) For the CONTACT_PHONE field, click
under Definition and select
under Definition and select Phone.
11) Click Preserve null values.
12) Click OK to close the Standardization Properties window.
h. Add a Match Codes node to the job flow.
1) In the Nodes resource pane, click
2) Click
to collapse the Data Inputs grouping of nodes.
in front of the Entity Resolution grouping of nodes.
3) Double-click the Match Codes node.
4) Type Generate Match Codes in the Name field.
5) Double-click the MANUFACTURER_Stnd field to move it from the Available list box to the
Selected list box.
6) Click
under Definition and select Organization.
7) Accept the default Sensitivity of 75.
8) Double-click the CONTACT_Stnd field to move it from the Available list box to the Selected
list box.
9) Click
under Definition and select Name.
10) Accept the default Sensitivity of 75.
11) Double-click the CONTACT_ADDRESS_Stnd field to move it from the Available list box
to the Selected list box.
12) Click
under Definition and select Address.
13) Accept the default Sensitivity of 85.
14) Double-click the CONTACT_STATE_PROV_Stnd field to move it from the Available list
box to the Selected list box.
15) Click
under Definition and select State/Province.
16) Accept the default Sensitivity of 85.
17) Double-click the CONTACT_POSTAL_CD field to move it from the Available list box to
the Selected list box.
18) Click
under Definition and select Postal Code.
4-198
Chapter 4 ACT
19) Accept the default Sensitivity of 85.
20) Click Generate null match codes for blank field values.
21) Click Preserve null values.
22) Click OK to close the Match Codes Properties window.
i. Select
 Top To Bottom to change the layout to top to bottom.
j. Add the Text File Input node to the data job.
1) Verify that the Nodes riser bar is selected in the Resource pane.
2) Click
in front of the Data Inputs grouping of nodes.
3) Right-click the Text File Input node and select Insert Node.
4) Right-click the Text File Input node and select Properties.
5) Type Manufacturers Contact List in the Name field.
6) Click
next to the Input file field.
a) Navigate to S:\Workshop\dqdmp1\data\Text Files.
b) Click Manufacturer_Contact_List.txt.
c) Click Open to close the Open window.
7) Select “ (double quotation mark) as the value that the Text qualifier field is set to.
8) Verify that Comma is the value for the Field delimiter field.
9) Click Number of rows to skip.
10) Verify that 1 (one) is the default value for number of rows to skip.
11) Click Import under the Fields area.
a) If necessary, navigate to S:\Workshop\dqdmp1\data\Text Files.
b) Click Manufacturer_Contact_List.dfl.
c) Click Open.
12) Click OK to save the changes and close the Text File Input Properties window.
k. Add a Standardization node to the job flow.
1) Click
in front of the Quality grouping of nodes.
2) Double-click the Standardization node. (Verify that the node was attached to the Text File
Input node).
3) Type Standardize Fields in the Name field.
4.5 Multi-Input/Multi-Output Data Jobs (Self-Study)
4-199
4) Double-click each of the following fields to move them from the Available list box to the
Selected list box.
COMPANY
NAME
WORK_ADDRESS
WORK_STATE
WORK_PHONE
5) For the COMPANY field, click
6) For the NAME field, click
under Definition and select Organization.
under Definition and select Name.
7) For the WORK_ADDRESS field, click
8) For the WORK_STATE field, click
State/Province(Abbreviation).
9) For the WORK_PHONE field, click
under Definition and select Address.
under Definition and select
under Definition and select Phone.
10) Click Preserve null values.
11) Click OK to close the Standardization Properties window.
l. Add a Match Codes node to the job flow.
1) Click
in front of the Entity Resolution grouping of nodes.
2) Double-click the Match Codes node.
3) Type Generate Match Codes in the Name field.
4) Double-click the COMPANY_Stnd field to move it from the Available list box to the
Selected list box.
5) Click
under Definition and select Organization.
6) Accept the default Sensitivity of 75.
7) Double-click the NAME_Stnd field to move it from the Available list box to the Selected list
box.
8) Click
under Definition and select Name.
9) Accept the default Sensitivity of 75.
10) Double-click the WORK_ADDRESS_Stnd field to move it from the Available list box to
the Selected list box.
11) Click
under Definition and select Address.
12) Accept the default Sensitivity of 85.
4-200
Chapter 4 ACT
13) Double-click the WORK_STATE_Stnd field to move it from the Available list box to the
Selected list box.
14) Click
under Definition and select State/Province.
15) Accept the default Sensitivity of 85.
16) Double-click the WORK_ZIP field to move it from the Available list box to the Selected list
box.
17) Click
under Definition and select Postal Code.
18) Accept the default Sensitivity of 85.
19) Click Generate null match codes for blank field values.
20) Click Preserve null values.
21) Click OK to close the Match Codes Properties window.
m. Add a Sequencer (Autonumber) node to the job flow.
1) Click
in front of the Utilities grouping of nodes.
2) Double-click the Sequencer (Autonumber) node.
3) Type Create Unique ID in the Name field.
4) Type ID in the Field name field.
5) Accept the default of 1 for the Start Number and Interval fields.
6) Click OK to close the Sequencer (Autonumber) Properties window.
n. Add the Data Joining node to the data job.
1) Verify that the Nodes riser bar is selected in the Resource pane.
2) Click
in front of the Data Integration grouping of nodes.
3) Right-click the Data Joining node and select Data Connections.
a) On the job flow diagram, drag the node below and between the two side-by-side branches.
b) Right-click the Data Joining node in the job flow and select Node Connections.
c) Click
in the Connect from area.
d) Click the Match Codes node and then click OK.
e) Click
in the Connect from area.
f) Click the Sequencer node and then click OK.
g) Click OK to save the changes and close the Data Connections window.
4) Right-click the Data Joining node and select Properties.
4.5 Multi-Input/Multi-Output Data Jobs (Self-Study)
4-201
5) Type Join DB Table and Text File in the Name field.
6) Click Right as the value for the Join type field.
7) Click Right table as the value for the Memory load option field.

The performance of the data job accelerates if enough memory is available to load
one of the sides into memory.
The data job flow should now resemble the following:
4-202
Chapter 4 ACT
8) Specify the following three sets of join conditions:
MANUFACTURER_Stnd_MatchCode = COMPANY_Stnd_MatchCode
CONTACT_Stnd_MatchCode = NAME_Stnd_MatchCode
CONTACT_ADDRESS_Stnd_MatchCode = WORK_ADDRESS_Stnd_MatchCode
CONTACT_STATE_PROV_Stnd_MatchCode = WORK_STATE_Stnd_MatchCode
or
MANUFACTURER_Stnd_MatchCode = COMPANY_Stnd_MatchCode
CONTACT_Stnd_MatchCode = NAME_Stnd_MatchCode
CONTACT_ADDRESS_Stnd_MatchCode = WORK_ADDRESS_Stnd_MatchCode
CONTACT_POSTAL_CD_MatchCode = WORK_ZIP_MatchCode
or
MANUFACTURER_Stnd_MatchCode = COMPANY_Stnd_MatchCode
CONTACT_Stnd_MatchCode = NAME_Stnd_MatchCode
CONTACT_PHONE_Stnd = WORK_PHONE_Stnd
CONTACT_POSTAL_CD_MatchCode = WORK_ZIP_MatchCode
9) Remove the _MatchCode fields from the Selected list box under the Output fields area.
o. Add a Branch node to the data job flow.
1) Click
in front of the Utilities grouping of nodes.
2) Double-click the Branch node.
3) Type Branch in the Name field.
4) Click OK to close the Branch Properties window.
p. Add a Data Validation node to the data job flow.
1) Click
in front of the Utilities grouping of nodes.
2) Double-click the Data Validation node.
3) Type Matches in the Name field.
4) Add the expression ID_1 is not null to the Expression area.
5) Click OK to close the Data Validation Properties window.
q. Add a second Data Validation node to the data job flow.
1) Click
in front of the Utilities grouping of nodes.
2) Right-click the Data Validation node and select Insert Node.
3) Drag the newest Data Validation node below and to the right of the Branch node.
4) Click the Branch node and drag it to the Data Validation node. A connection is established.
5) Right-click the Data Validation node and select Properties.
6) Type Non-Matches in the Name field.
4.5 Multi-Input/Multi-Output Data Jobs (Self-Study)
7) Add the expression ID_1 is null to the Expression area.
8) Click OK to close the Data Validation Properties window.
r. Add a Text File Output node for the matches.
1) Click the Matches data validation node.
2) Click
in front of the Data Outputs grouping of nodes.
3) Double-click the Text File Output node.
4) Type Manufacturers - Matches in the Name field.
5) Specify the output file information.
a) Click
next to the Output file field.
b) Navigate to S:\Workshop\dqdmp1\Exercises\files\output_files.
c) Type Ch4E6_Manufacturers_Matches.txt in the File name field.
d) Click Save.
6) Specify attributes for the file.
a) Verify that the Text qualifier field is set to “ (double quotation mark).
b) Verify that Field delimiter field is set to Comma.
c) Click Include header row.
d) Click Display file after job runs.
7) Specify desired fields for the output file.
a) Click
to move all default selected fields from the list.
b) Specify the following fields with the specified output name:
Field Name
Output Name
ID_1
DB_ID
MANUFACTURER_1
MANUFACTURER
CONTACT_1
CONTACT
ID_2
TEXT_ID
COMPANY_2
COMPANY
NAME_2
NAME
8) Export the field layout.
a) Click Export.
b) If necessary, navigate to S:\Workshop\dqdmp1\Exercises\files\output_files.
4-203
4-204
Chapter 4 ACT
c) Type Ch4E6_Manufacturer_Matches.dfl in the File name field.
d) Click Save.
9) Click OK to save the changes and close the Text File Output Properties window.
s. Add a Text File Output node for the matches.
1) Click the Non-Matches data validation node.
2) Click
in front of the Data Outputs grouping of nodes.
3) Double-click the Text File Output node.
4) Type Manufacturers – Non-Matches in the Name field.
5) Specify the output file information.
a) Click
next to the Output file field.
b) Navigate to S:\Workshop\dqdmp1\Exercises\files\output_files.
c) Type Ch4E6_Manufacturers_NonMatches.txt in the File name field.
d) Click Save.
6) Specify attributes for the file.
a) Verify that the Text qualifier field is set to “ (double quotation mark).
b) Verify that the Field delimiter field is set to Comma.
c) Click Include header row.
d) Click Display file after job runs.
7) Specify desired fields for the output file.
a) Click
to move all default selected fields from the list.
b) Specify the following fields with the specified output name:
Field Name
Output Name
COMPANY_2
COMPANY
NAME_2
NAME
WORK_ADDRESS_2
WORK_ADDRESS
WORK_CITY_2
WORK_CITY
WORK_STATE_2
WORK_STATE
WORK_ZIP_2
WORK_ZIP
WORK_PHONE_2
WORK_PHONE
8) Export the field layout.
4.5 Multi-Input/Multi-Output Data Jobs (Self-Study)
a) Click Export.
b) If necessary, navigate to S:\Workshop\dqdmp1\Exercises\files\output_files.
c) Type Ch4E6_Manufacturer_NonMatches.dfl in the File name field.
d) Click Save.
9) Click OK to save the changes and close the Text File Output Properties window.
4-205
4-206
Chapter 4 ACT
The final job flow should resemble the following:
4.5 Multi-Input/Multi-Output Data Jobs (Self-Study)
t. Save the job.
1) Click the Data Flow tab.
2) Select File  Save.
u. Run the job.
1) Verify that the Data Flow tab is selected.
2) Select Actions  Run Data Job.
The text files appear.
v. Select File  Exit to close each of the Notepad windows.
4-207
4-208
Chapter 4 ACT
Chapter 5 MONITOR
5.1
Business Rules ............................................................................................................... 5-3
Demonstration: Creating a Row-Based Business Rule ......................................................... 5-6
Exercises .............................................................................................................................. 5-11
Demonstration: Adding a Business Rule and an Alert to a Profile ....................................... 5-15
Demonstration: Performing a Historical Visualization .......................................................... 5-21
Exercises .............................................................................................................................. 5-25
Demonstration: Creating a Monitoring Job for a Row-Based Rule ...................................... 5-29
Demonstration: Creating Another Data Job Calling the Same Task .................................... 5-38
Exercises .............................................................................................................................. 5-41
Demonstration: Creating a Monitoring Job for a Set-Based Rule ........................................ 5-42
Exercises .............................................................................................................................. 5-48
Demonstration: Viewing the Monitor Dashboard ................................................................. 5-50
5-2
Chapter 5 MONITOR
5.1 Business Rules
5.1
Business Rules
Objectives


Define a business rule.
Create a row-based business rule.
3
What Is a Business Rule?
Business
Rule
4
4
A formula, validation or comparison that
can be applied to a given set of data.
Data must either pass or fail the business
rule.
5-3
5-4
Chapter 5 MONITOR
Business Rule Examples



Validate whether product codes adhere to a particular
pattern.
Test domain values against a list of acceptable values.
– Gender is Male, Female, or Unknown.
– Region is East, Central, or West.
Detect unexpected variances.
– The Product Description field contains blanks or
nulls.
– More than 5% of mailing addresses are not
verifiable.
– The total sales amount for Q1 is less than the
expected target amount.
5
Business rules can be used in both data profiles and data jobs to monitor your data.
Creating a Business Rule
6
To create a business rule, do the following:
1) Create Fields.
The field names are placeholders. When the business rule is applied to the data, the input data field
names are mapped to these placeholder names.
2) Create Rule.
This is the criteria that the input data is checked against.
5.1 Business Rules
5-5
Types of Rules
Row-based rule
Evaluates every row of data passed into the monitoring node
Set-based rule
Evaluates and applies rules to all of the input data in totality (for example, evaluates
1000 rows as a set)
Group-based rule
Evaluates and applies all rules to groups of data (for example, if data is grouped by
product code, then the rules are evaluated for each product code)
When building a business rule, you can do the following:
• Take advantage of the Expression Builder that is a wizard-like interface that can be used to help build
business rules from a predefined list of data checks
• Specify fields that are not being evaluated and enable the passing of data through the monitoring node
so that the fields are available in the tasks and can be output into audit files.
• Specify the expression or the DataFlux syntax of the business rules that were built with the Expression
Builder
5-6
Chapter 5 MONITOR
Creating a Row-Based Business Rule
In this demonstration, you create a row-based business rule to monitor when an e-mail address and a
phone number are missing from a record.
1. If necessary, select Start  All Programs  DataFlux  Data Management Studio 2.2.
2. Verify that the Home tab is selected.
3. Select Tools  Business Rules Manager  Basics Demos.
A tab appears and the Business Rules Manager appears in the Basics Demos repository.
4. Create three needed fields.
a. Right-click the Fields folder and select New Field.
1) Type PK in the Name field.
2) Type Primary Key in the Description field.
3) Click OK.
5.1 Business Rules
b. Right-click the Fields folder and select New Field.
1) Type EMAIL in the Name field.
2) Type Email Address in the Description field.
3) Click OK.
c. Right-click the Fields folder and select New Field.
1) Type PHONE in the Name field.
2) Type Phone Number in the Description field.
3) Click OK.
The final set of fields should resemble the following:
5-7
5-8
Chapter 5 MONITOR
5. Create a new row-based rule.
a. Right-click the Rules folder and select New Rule.
b. Type Email and Phone Missing in the Name field.
c. Verify that Row is selected in the Type area.
d. Under the Definition area, click Builder on the Checks tab.
The Expression Builder window appears.
e. Click Get information about the contents of a field (is null, is numeric, etc.) under Get Field
Information in the Step 1 area.
f.
Click
to move the check to the Step 2 area.
5.1 Business Rules
5-9
g. Click the Field is null link.
The Get Field Information window appears.
1) Click
next to the Field field. The Select Field window appears.
2) Click the EMAIL field.
3) Click OK to close the Select Field window.
h. Click
in the Check field and select is null.
i.
Click OK to close the Get Field Information window.
j.
Click Get information about the contents of a field (is null, is numeric, etc.) under Get Field
Information in the Step 1 area.
k. Click
l.
to move the check to the Step 2 area.
Click the Field is null link.
The Get Field Information window appears.
1) Click
next to the Field field. The Select Field window appears.
2) Click the PHONE field.
3) Click OK to close the Select Field window.
m. Click
in the Check field and select is null.
n. Click OK to close the Get Field Information window.
o. Click OK to close the Expression Builder window.
p. Click the Fields tab.
q. Double-click the PK field to move it from the Available list to the Selected list.
r.
Click OK to close the New Rule window.
5-10
Chapter 5 MONITOR
The new rule appears in the Rules folder.
6. Select File  Close to close the Business Rules Manager tab.
7. If necessary, click the Home tab.
8. If necessary, click the Folders riser bar.
9. Click the Basics Demos repository.
10. Select View  Refresh.
A Shared Data folder appears.
11. Expand the Shared Data folder.
The new rule appears in the Shared Data folder.
5.1 Business Rules
5-11
Exercises
1. Creating a Row-Based Rule
Create a row-based rule that performs two checks:
Is a field numeric?
OR
Is the length of the field equal to 14?
Some specifics follow:
• Use the Basics Exercises repository.
• Create two fields – PK (Primary Key) and UPC (Universal Product Code).
• Use the check Get information about the contents of a field to specify the check UPC is not
numeric.
• Use the check Compare the length of a field to a value to specify the check Length of UPC is
not equal to 14.
• When used, either of the checks is valid (add an OR between the two checks).
• Select the PK field in addition to the UPC field.
5-12
Chapter 5 MONITOR
5.2 Data Profiling
Objectives



Discuss how to define and explore business rules in
data profiles.
Discuss how to define and explore alerts in data
profiles.
Discuss how to define historical visualizations in a
data profile report.
15
Using a Business Rule in a Data Profile
16
On the Properties tab for a profile, select the Business Rules sub-tab to apply business rule(s) to the data.
5.1 Business Rules
5-13
Surfacing a Business Rule in Profile Reports
17
After the profile is run, on the Report tab, select the Business Rules sub-tab to view the number of
violations.
Alerts
18
On the Properties tab for a profile, select the Alerts sub-tab to set up alert(s) for the data. After the profile
is run, on the Report tab, select the Alerts sub-tab to determine whether any alerts were triggered.
5-14
Chapter 5 MONITOR
Historical Visualizations
19
Historical visualizations are customized charts that show how metrics might change over a period of
time.
5.1 Business Rules
5-15
Adding a Business Rule and an Alert to a Profile
In this demonstration, you examine the steps necessary to add a business rule that was created previously.
In addition, you demonstrate how to set up an alert and how to locate and view the alerts in the profile
report.
1. If necessary, select Start  All Programs  DataFlux  Data Management Studio 2.2.
2. Verify that the Home tab is selected.
3. Click the Folders riser bar.
4. Expand the Basics Demos repository.
5. Expand the profiles_and_explorations folder.
6. Double-click the Ch3D3_dfConglomerate_Profile profile.
7. Click the Properties tab.
8. Expand dfConglomerate Gifts and click Customers.
9. Click the Business Rules sub-tab.
10. Click Add.
The Business Rule Properties window appears.
a. Click
in the Business rule field and select Email and Phone Missing.
b. Click
in the Field Name field for PHONE and select BUSINESS PHONE.
c. Click
in the Field Name field for PK and select ID.
5-16
Chapter 5 MONITOR
d. Click the Log data that violates the business rule check box.
e. Click OK.
The Business Rules tab displays the following:
5.1 Business Rules
11. Click Products (in dfConglomerate Gifts).
12. Click the Alerts sub-tab.
13. Click Add.
The Add Alert window appears.
a. Click Standard Metric.
b. Click Continue.
The Add Standard Metric Alert window appears.
c. Click
in the Field name field and select PRODUCT CODE.
d. Click
in the Metric field and select Pattern Count.
e. Click
in the Comparison field under Alert condition and select Metric is greater than.
f.
Type 3 (three) in the Value field.
g. Type Product Code Pattern Count is greater than 3 in the Description field.

The Send e-mail option requires that the EMAIL SMTP Server is configured in the
app.cfg file.
5-17
5-18
Chapter 5 MONITOR
h. Click OK.
The Alerts tab displays the following:
14. Select File  Save Profile to save the changed profile.
15. Select Actions  Run Profile Report.
a. Type Additional profile run in the Description field.
b. Verify that Append to existing report is selected.
c. Click OK to close the Run Profile window.
The profile executes. A status of the execution is displayed.
The Report tab becomes active.
5.1 Business Rules
5-19
16. Investigate the triggered alert.
a. Click
in front of dfConglomerate Gifts. Expand Products  PRODUCT CODE. Verify that
a warning symbol with each selection.
b. Click the Products table.
c. Click the Alerts sub-tab.
The Alert was triggered and is written on the sub-tab. If the Alert was not triggered, there would
be no indicators on the Reports tab and nothing written on the Alerts sub-tab for the table.
QUESTION: What is the Pattern Count metric for the PRODUCT CODE field?
5-20
Chapter 5 MONITOR
17. Investigate the applied business rule.
a. Click the Customers table.
b. Click the Business Rules sub-tab.
The actual records where the violations occurred are displayed.
c. Double-click the summary of the violations. The Business Rule Violation Drill Through window
appears.

This drill-through information is available only because Log data that violates the
business rule was selected.
d. Click Close to close the Business Rule Violation Drill Through window.
e. Click the Employees table.
f.
If necessary, click the Business Rules sub-tab.
There should not be any violations for table. The business rule was established only for the
Customers table.
18. Select File  Close Profile.
5.1 Business Rules
5-21
Performing a Historical Visualization
In this demonstration, you run a data job that updates the Customers table with standardized phone fields.
A profile on the Customers table is rerun and historical metrics are examined visually.
1. Access a data job in the Basics Solutions repository.
a. If necessary, select Start  All Programs  DataFlux  Data Management Studio 2.2.
b. Verify that the Home tab is selected.
c. Click the Folders riser bar.
d. Expand the Basics Solutions repository.
e. Expand the batch_jobs folder.
f.
Double-click the Ch5D3_Update_Customers_Table data job.
2. Review the data job.
a. Right-click the Data Source node and select Properties.
1) Verify that the ID field and the four phone number fields are selected from the Customers
table.
2) Click OK to close the Data Source Properties window.
5-22
Chapter 5 MONITOR
b. Right-click the Standardization node and select Properties.
1) Verify that the Phone standardization definition is applied to each of the four phone number
fields.
2) Click OK to close the Standardization Properties window.
c. Right-click the Data Target (Update) node and select Properties.
1) Verify that the ID Input Field is mapped to the ID Table Field.
2) Verify that the standardized phone fields (those ending with default _Stnd) replace their
corresponding phone fields.
3) Click OK to close the Data Target (Update) Properties window.
3. Run and then close the data job.
a. Verify that the Data Job tab is selected.
b. Select Actions  Run Data Job.
c. Verify that the job runs with no errors.
d. Select File  Close.
4. Access and run a profile that involves the Customers table.
a. Verify that the Home tab is selected.
b. Click the Folders riser bar.
c. Expand the Basics Demos repository.
5.1 Business Rules
5-23
d. Expand the profiles_and_explorations folder.
e. Double-click the Ch3D3_dfConglomerate_Profile profile.
f.
Select Actions  Run Profile Report.
g. Type Additional processing after phone standardizations in the Description field.
h. Verify that Append to existing report is selected.
i.
Click OK to close the Run Profile window.
The profile executes. A status of the execution is displayed.
The Report tab becomes active.
5. Review the generated information for the Customers table.
a. Verify that the Report tab is active.
b. Click
in front of dfConglomerate Gifts.
c. Click the Customers table.
d. Click the Visualizations sub-tab.
e. Add a new graph.
1) Click
.
2) Type Historical Line for Phone Metrics in the Description field.
3) Click
in the Chart type field and select Historical Line.
4) Click BUSINESS PHONE under the Fields list.
5) Click Unique Count, Null Count, Blank Count, and Pattern Count under the Metrics list.
6) Select the first date-time value as the value for the Start date field.
7) Select the last date-time value as the value for the End date field.
8) Click OK to close the Chart Properties window.
5-24
Chapter 5 MONITOR
The Visualizations tab surfaces the newly specified historic line plot.

Your view might differ depending on the number of times that the above profile was run.
The Pattern Count and Unique Count metrics decreased for this last execution.
6. Select File  Close Profile.
5.1 Business Rules
5-25
Exercises
2. Adding a Business Rule and an Alert to a Profile
Access the profile named Ch3E3_dfConglomerateGrocery_Profile (in the Basics Exercises
repository).
Update the profile as follows:
• Add the UPC Check business rule for the BREAKFAST_ITEMS table. Map PK to ID and be sure
to log the data that violates the business rules.
• On the UPC Check business rule, add an alert that looks for the percentage of failed rows greater
than 0 (zero).
• Save and rerun the profile.
QUESTION: Were there violations to the business rule? If so, what records did not pass the rule
check?
Answer:
QUESTION: Was the alert triggered?
Answer:
5-26
Chapter 5 MONITOR
5.3 Data Monitoring Jobs
Objectives




Discuss how to define a monitoring task.
Discuss and use several of the available events.
Create a monitoring data job.
View the Monitoring report.
30
Monitoring Tasks
Monitoring tasks are created by pairing a defined
business rule with one or more events. Some available
events include the following:
 Call a realtime service
 Execute a program
 Log error to repository
 Log error to text file
 Run a local job
 Run a local profile
 Send e-mail message
 Set a data flow key or value
 Write a row to a table
31
When creating a Monitoring task, select the business rule(s) that are part of the task. For each business
rule, select the event(s) that is triggered if there is a rule violation.

If the Send email message event is selected, a separate e-mail is sent for each rule violation.
5.1 Business Rules
5-27
Business Rules in Data Jobs
32
In the Data Monitoring node, map the fields used in the task to the input fields in the data job. The Export
function from Business Rule Manager suggests a mapping based on field names. If a field is not mapped
or mapped incorrectly, select the menu to map the task field to the appropriate field name.
A Monitoring task can be applied to one or more data sources, including modified data n a data job. After
the job is run, the log lists the number of rows read and the number of events triggered by the Data
Monitoring step.

Previewing the Data Monitoring node does not cause an event to be triggered. An event is
triggered only when the job is run.
5-28
Chapter 5 MONITOR
Monitoring Report
The Monitoring report
provides a summary
listing of monitoring
task executions for a
selected repository.
 Details of violations are only listed if the Log error to
repository event was selected for the rule in the
monitoring task.
33
The listing of items can be sorted or grouped to enable easier viewing of the results of the Monitoring task
executions. The data can also be exported as a CSV or XML file.
5.1 Business Rules
5-29
Creating a Monitoring Job for a Row-Based Rule
In this demonstration, you create a task that combines a previously created rule with two events. This task
is exported as a data job, and the data job is executed. The results are reviewed.
1. If necessary, select Start  All Programs  DataFlux  Data Management Studio 2.2.
2. Verify that the Home tab is selected.
3. Select Tools  Business Rules Manager  Basics Demos.
A tab appears and the Business Rules Manager appears in the Basics Demos repository.
4. Right-click the Tasks folder and select New Task.
a. Type Email and Phone Missing in the Name field.
b. Double-click the Email and Phone Missing rule to move it from the Available list to the
Selected list.
c. With the rule highlighted in the Selected list, click Rule Details.
The Rule Details window appears.
1) Click Add in the Events window.
a) Click
in the Select event field and select Log error to repository.
b) Click Continue.
to move the three fields (EMAIL, PHONE, PK) from Available fields to
c) Click
Selected fields.
d) Click OK to close the Log Error To Repository Event Properties window.
2) Click Add in the Rule Details window.
a) Click
in the Select event field and select Log error to text file.
b) Click Continue.
c) Click
next to the File field.
(1) Navigate to S:\Workshop\dqdmp1\Demos\files\output_files.
(2) Type Ch5D4_EmailAndPhoneMissing.txt in the File name field.
(3) Click Save.
d) Type PK where Email and Phone are Both Missing in the Header field.
e) Type %%PK%% in the Message field.
5-30
Chapter 5 MONITOR
f) Type --End of File-- in the Footer field.
The final settings for the Log Error To Text File Event Properties
window should resemble the following:
g) Click OK to close the Log Error To Text File Event Properties window.
3) Click OK to close the Rule Details window.
d. Click OK to close the New Task window.
5.1 Business Rules
5. Right-click the Sources folder and select New Source.
a. Type Customers in the Name field.
b. Type Customers Table in the Description field.
c. Click OK to close the New Source window.
6. Right-click the Sources folder and select New Source.
a. Type Employees in the Name field.
b. Type Employees Table as the value for the Description field.
c. Click OK to close the New Source window.
The two sources should appear as follows:
5-31
5-32
Chapter 5 MONITOR
7. If necessary, expand the Tasks folder.
8. Right-click the Email and Phone Missing task and select Export To Data Job.
a. Verify that Data Source is selected.
b. Click Next.
c. Click
next to the Input table field.
1) Expand the dfConglomerate Gifts data source.
2) Click the Customers table.
3) Click OK to close the Select Table window.
d. Move all fields from the Selected list to the Available list.
e. Double-click ID to move it from the Available list to the Selected list.
f.
Double-click EMAIL to move it from the Available list to the Selected list.
g. Double-click BUSINESS PHONE to move it from the Available list to the Selected list.
h. Click Next.
5.1 Business Rules
i.
Click
in the Data Source Field field for the PHONE task field and select BUSINESS
PHONE.
j.
Click
in the Data Source Field field for the PK task field and select ID.
k. Click Next.
l.
Type Both Email And Phone Missing in the Description field.
m. Click
next to the Job name field.
1) Type Ch5D4_BothEmailAndPhoneMissing in the Name field.
2) If necessary, select Basics Demos as the value for the Save in field.
3) Double-click the batch_jobs folder.
4) Click Save.
n. Click
in the Source field and select Customers.
o. Click Finish.

If the folders for Basics Demos do not appear when saving the above data job, do the
following:
• Click the Home tab.
• Click the Administration riser bar.
• Click
to expand Repository Definitions.
5-33
5-34
Chapter 5 MONITOR
• Right-click Basics Demos and select Update lineage.
• Click Close to close the Updating Repository Lineage window.
9. Access the new data job.
a. Click the Home tab.
b. Click the Folders riser bar.
c. Expand Basics Demos  batch_jobs.
d. Double-click the Ch5D4_BothEmailAndPhoneMissing data job.
e. If necessary, click
.
The data job should resemble the following:
10. Specify updated attributes for each of the nodes.
a. If necessary, click the Data Source node in the job diagram.
b. Verify that the Details pane is displayed.
c. If necessary, click the Basic Settings tab.
d. Type Customers Table as the value for the Name field.
e. Type Data Source as the value for the Description field.
f.
Click the Data Monitoring node in the data flow diagram.
g. Type Email/Phone Missing as the value for the Name field.
h. Type Data Monitoring as the value for the Description field.
The data flow should resemble the following:
11. Select File  Save.
5.1 Business Rules
12. Select Actions  Run Data Job.
13. When the job completes, click the Log tab.
A message was written to the log for each violation that was encountered.
14. Access Windows Explorer and examine the text file that was created.
a. Select Start  All Programs  Accessories  Windows Explorer to access Windows
Explorer.
b. Navigate to S:\Workshop\dqdmp1\Demos\files\output_files.
c. Double-click Ch5D4_EmailAndPhoneMissing.txt.
A Notepad window appears with the results of the cluster edits.
d. Select File  Exit to close the Notepad window.
5-35
5-36
Chapter 5 MONITOR
15. From the data job, select Tools  Monitoring Report  Basics Demos.
16. Click the Trigger Values tab.
5.1 Business Rules
5-37
The specific violations are listed only because you selected the log error for the repository event. The
summary information is written regardless, but the specifics are listed only if there are violations and
this event is selected.
17. Click the History Graph tab.
18. Click the bar chart tool (
).
19. Select File  Close to close the tab for the Monitoring Report.
20. Select File  Close to close the tab for the data job.
21. Select File  Close to close the Business Rules Manager tab.
5-38
Chapter 5 MONITOR
Creating Another Data Job Calling the Same Task
1. If necessary, select Start  All Programs  DataFlux  Data Management Studio 2.2.
2. Verify that the Home tab is selected.
3. Click the Folders riser bar.
4. Click
to expand the Basics Demos repository.
5. Click
to expand the batch_jobs folder.
6. Right-click the Ch5D4_BothEmailAndPhoneMissing data job and select Copy to Folder.
a. If necessary, navigate to Basics Demos and select the batch_jobs folder in the Copy To Folder
window.
b. Click Open.
c. Click Rename the new item.
d. Click OK.
7. Right-click the new data job (Copy of Ch5D4_BothEmailAndPhoneMissing) and select Rename.
a. Type Ch5D5_BothEmailAndPhoneMissing for the new name.
b. Press ENTER.
8. Double-click the new data job (Ch5D5_BothEmailAndPhoneMissing). The data job opens on a new
tab.
5.1 Business Rules
9. Change the data source of the data job.
a. Double-click the Data Source node (or right-click and select Properties).
b. Type Employees Table as the value for the Name field.
c. Click
next to the Input table field.
1) If necessary, expand the dfConglomerate Gifts data source.
2) Click the Employees table.
3) Click OK to close the Select Table window.
d. Click
to remove all selected fields.
e. Double-click ID to move it from the Available list to the Selected list.
f.
Double-click EMAIL to move it from the Available list to the Selected list.
g. Double-click BUSINESS PHONE to move it from the Available list to the Selected list.
h. Click OK to close the Data Source Properties window.
10. Edit the Data Monitoring node.
a. Double-click the Data Monitoring node (or right-click and select Properties).
b. Click
in the Source field and select Employees.
c. Click OK to close the Data Monitoring Properties window.
11. Select File  Save.
12. Select Actions  Run Data Job.
13. When the job completes, click the Log tab.
There were no violations this time. Therefore, the text file is not created.
5-39
5-40
Chapter 5 MONITOR
14. From the data job, select Tools  Monitoring Report  Basics Demos.
15. Verify that no triggers occurred.
16. Select File  Close to close the tab for the Monitor Viewer.
17. Select File  Close to close the tab for the data job.
5.1 Business Rules
Exercises
3. Creating a Monitoring Job for a Row-Based Rule
• Create a task in the Demos Exercises repository
-
Use the UPC Check rule.
- Add an event of Log Error to the Repository.
• Create a source in the Demos Exercises repository (use a name of Breakfast Items with a
description of Breakfast Items Table).
• Export the new task as a data job with the following specifics:
-
Use the BREAKFAST_ITEMS table in the dfConglomerate Grocery data connection.
-
Pass just the ID and UPC fields to the Data Monitoring node.
-
Map the ID data source field to the PK task field.
- Specify a data job name of Ch5E3_UPCCheck with a description of Check UPC Field.
• Run the job, and review the Log.
QUESTION: How many records were monitored?
Answer:
QUESTION: How many records triggered the monitoring event?
Answer:
• View the records (if there are any) that triggered the monitoring event.
5-41
5-42
Chapter 5 MONITOR
Creating a Monitoring Job for a Set-Based Rule
In this demonstration, you create a new field for use in a new rule where the rule is a set-based rule. The
rule is combined with an event for a task and the new task is used in a data job.
1. If necessary, select Start  All Programs  DataFlux  Data Management Studio 2.2.
2. Verify that the Home tab is selected.
3. Select Tools  Business Rules Manager  Basics Demos.
4. Create a new field by right-clicking the Fields folder and selecting New Field.
a. Type COUNTRY in the Name field.
b. Type Country Value in the Description field.
c. Click OK.
5. Create a new set-based rule.
a. Right-click the Rules folder and select New Rule.
b. Type Country Pattern Count Check in the Name field.
c. Click Set in the Type area.
d. Under Definition, click Builder on the Checks tab.
The Expression Builder window appears.
e. Click Compare a field to a value or to another field under Compare Entire Field in the Step 1
area.
f.
Click
to move the check to the Step 2 area.
5.1 Business Rules
5-43
g. Click the Field is equal to Value link.
The Get Field Information window appears.
1) Click
next to the Field field. The Select Field window appears.
2) Click
to expand the COUNTRY field.
3) Click
to expand the Standard folder.
4) Click PATCOUNT.
5) Click OK to close the Select Field window.
h. Click
i.
Type 1 (one) in the Value field.

j.
in the Check field and select greater than.
SM represents Standard Metric and HSM represents Historical Standard Metric.
Click OK to close the Compare To Value Or Field window.
k. Click OK to close the Expression Builder window.
l.
Click OK to close the New Rule window.
6. Right-click the Tasks folder and select New Task.
a. Type Country Pattern Count Check in the Name field.
b. Double-click the Country Pattern Count Check rule to move it from the Available list to the
Selected list.
c. With the rule highlighted in the Selected list, click Rule Details.
The Events window appears.
5-44
Chapter 5 MONITOR
1) Click Add in the Events window.
a) Click
in the Select event field and select Log error to repository.
b) Click Continue.
c) Double-click SM$COUNTRY$BLANKCOUNT to move it from Available fields to
Selected fields.
d) Double-click SM$COUNTRY$NULLCOUNT to move it from Available fields to
Selected fields.
e) Double-click SM$COUNTRY$PATCOUNT to move it from Available fields to
Selected fields.
f) Double-click SM$COUNTRY$PATMODE to move it from Available fields to
Selected fields.
g) Click OK to close the Log Error To Repository Event Properties window.
2) Click OK to close the Events window.
d. Click OK to close the New Task window.
7. Select File  Close to close the Business Rules Manager tab.
8. Create a data job to call the task.
a. If necessary, click the Home tab.
b. Verify that the Folders riser bar is selected.
c. Expand Basics Demos  batch_jobs.
d. Click
 Data Job.
1) Type Ch5D6_StandardizeCountry_CheckPattern in the Name field.
2) Click OK.
5.1 Business Rules
5-45
e. Add the Data Source node to the Data Flow Editor.
1) Verify that the Nodes riser bar is selected in the resource pane.
2) Click
in front of the Data Inputs grouping of nodes.
3) Double-click the Data Source node.
f.
Specify properties for the Data Source node.
1) Type Customers Table in the Name field.
2) Click
a) Click
next to the Input table field.
in front of the dfConglomerate Gifts data source.
b) Click Customers.
c) Click OK to close the Select Table window.
3) Click
to remove all selected fields.
4) Double-click COUNTRY/REGION to move it from the Available list to the Selected list.
5) Click OK to save the changes and close the Data Source Properties window.
g. Add a Standardization node to the job flow.
1) In the Nodes resource pane, click
2) Click
to collapse the Data Inputs grouping of nodes.
in front of the Quality grouping of nodes.
3) Double-click the Standardization node.
h. Specify properties for the Standardization node.
1) Type Standardize Country in the Name field.
2) Double-click COUNTRY/REGION to move it from the Available list to the Selected list.
3) For the COUNTRY/REGION field, click
Country (ISO 3 Char).
under Definition and select
4) Click Preserve null values.
5) Click OK to save the changes and close the Standardization Properties window.
i.
Add a Data Monitoring node to the job flow.
1) In the Nodes resource pane, click
2) Click
to collapse the Quality grouping of nodes.
in front of the Monitor grouping of nodes.
3) Double-click the Data Monitoring node.
5-46
Chapter 5 MONITOR
j.
Specify properties for the Data Monitoring node.
1) Type Check Country Pattern in the Name field.
2) Click
for the Task name field and select Country Pattern Count Check.
3) Type Check Country Pattern in the Description field.
4) Click
for the Source field and select Customers.
5) Click
under the Field Name field and select COUNTRY/REGION_Stnd (for the task
field of COUNTRY).
6) Click OK to save the changes and close the Data Monitoring Properties window.
9. Select File  Save Data Job.
10. Select Actions  Run Data Job.
11. When the job completes, click the Log tab.
A message was written to the log for each violation that was encountered.

The number of aggregate rows read can be only 1 and the number of triggered events can be
only 0 or 1 because it is a set rule.
5.1 Business Rules
12. Select Tools  Monitoring Report  Basics Demos.
13. From the Details pane, click the Trigger Values tab.
14. Select File  Close to close the tab for the Monitor Viewer.
15. Select File  Close to close the tab for the data job.
5-47
5-48
Chapter 5 MONITOR
Exercises
4. Creating a Monitoring Job for a Group-Based Rule
• Create a new field BRAND (Brand Name) in the Demos Exercises repository.
• Create a group-based rule named Low Brand Count that groups on distinct values of BRAND.
The rule should check for the counts of the values of BRAND being less than 5 (five).
• Create a task (named Low Brand Count) that calls the Low Brand Count rule with a
Log Error to Repository event. Log the fields BRAND and SM$BRAND$COUNT.
• Export the new task as a data job named Ch5E4_LowBrandCount by selecting
Basics Exercises  batch_jobs.
• Run the new job.
QUESTION:
How many aggregate records were monitored?
Answer:
QUESTION:
How many records triggered the monitoring event?
Answer:
• View the records (if there are any) that triggered the monitoring event.
5.1 Business Rules
5-49
Monitoring Dashboards
50
5
0
Displays alert trigger information for a selected repository on a separate tab. Each tab contains the
following elements:
• Triggers Per Date (Last X Days)
• Triggers Per Records Processed
• Triggers Per Source (Last X Days)
• Triggers Per User (Last X Days)

The Triggers per Date graph shows a percentage of triggers per records processed. The graph
scale is from 0 to 100, representing percent. For example, if you have 100 triggers in 5000 rows,
the triggers represent 2.0% of the records processed. If you move your cursor over the data point,
you get a tooltip with helpful information such as “... 11 triggers in 3276 rows.”

The number of days for the Last X Days is configured as part of the Monitoring options by
selecting Tools  Data Management Studio Options from the menu.
5-50
Chapter 5 MONITOR
Viewing the Monitor Dashboard
In this demonstration, you examine the accumulated information in the Basics Demos repository via the
Monitor dashboard.
1. If necessary, select Start  All Programs  DataFlux  Data Management Studio 2.2.
DataFlux Data Management Studio appears.
2. Verify that the Home tab is selected.
3. Click the Information riser bar.
4. Click Monitor from the Information pane.
5. Click the tab labeled Basics Demos - Dashboard.
5.1 Business Rules
5-51

The Triggers per Date graph shows a percentage of triggers per records processed. The graph
scale is from 0 to 100, representing percent. For example, if you have 100 triggers in 5000 rows,
the triggers represent 2.0% of the records processed. If you move your cursor over the data point,
you get a tooltip with helpful information such as “... 11 triggers in 3276 rows.”

The Number of Days for the Last X Days is configured as part of the Monitoring options by
selecting Tools  Data Management Studio Options from the menu.
5-52
Chapter 5 MONITOR
5.4 Solutions to Exercises
1. Creating a Row-Based Rule
a. If necessary, select Start  All Programs  DataFlux  Data Management Studio 2.2.
b. Select Tools  Business Rules Manager  Basics Exercises.
A tab appears and the Business Rules Manager appears in the Basics Demos repository.
c. Create two needed fields.
1) Right-click the Fields folder and select New Field.
a) Type PK in the Name field.
b) Type Primary Key in the Description field.
c) Click OK.
2) Right-click the Fields folder and select New Field.
a) Type UPC in the Name field.
b) Type Universal Product Code in the Description field.
c) Click OK.
d. Create a new row-based rule.
1) Right-click the Rules folder and select New Rule.
2) Type UPC Check in the Name field.
3) Verify that Row is selected in the Type area.
4) Under Definition, click Builder on the Checks tab.
a) Click Get information about the contents of a field (is null, is numeric, etc.) under Get
Field Information in the Step 1 area.
b) Click
to move the check to the Step 2 area.
c) Click the Field is null link.
(1) Click
next to the Field field. The Select Field window appears.
(i) Click the UPC field.
(ii) Click OK to close the Select Field window.
(2) Click
in the Check field and select is not numeric.
(3) Click OK to close the Get Field Information window.
5.1 Business Rules
5-53
d) Click OR.
e) Click Compare the length of a field to a value under Get Field Information in the Step 1
area.
f) Click
to move the check to the Step 2 area.
g) Click the Length of Field is equal to Value link.
(1) Click
next to the Field field. The Select Field window appears.
(i) Click the UPC field.
(ii) Click OK to close the Select Field window.
(2) Click
in the Operator field and select not equal to.
(3) Type 14 in the Length field.
5-54
Chapter 5 MONITOR
(4) Click OK to close the Compare Length To Value window.
h) Click OK to close the Expression Builder window.
The Definition area of the Checks tab should resemble the following:
5) Click the Fields tab.
6) Double-click the PK field to move it from the Available list to the Selected list.
7) Click OK to close the New Rule window.
e. Select File  Close to close the Business Rules Manager tab.
2. Adding a Business Rule and an Alert to a Profile
a. If necessary, select Start  All Programs  DataFlux  Data Management Studio 2.2.
b. Verify that the Home tab is selected.
c. Click the Folders riser bar.
d. Expand the Basics Exercises repository.
e. Expand the profiles_and_explorations folder.
f. Double-click the Ch3E3_dfConglomerateGrocery_Profile profile.
g. Click the Properties tab.
h. Expand dfConglomerate Grocery and click BREAKFAST_ITEMS.
i. Click the Business Rules sub-tab.
j. Click Add.
1) Click
in the Business rule field and select UPC Check.
2) Click
in the Field Name field for PK and select ID.
3) Click Log data that violates the business rule.
4) Click OK.
5.1 Business Rules
5-55
k. Click the Alerts sub-tab.
l. Click Add.
1) Click Business Rule.
2) Click Continue.
3) Click
in the Business rule field and select UPC Check CODE.
4) Click
in the Comparison field under Alert condition and select Percentage of failed rows
is greater than.
5) Type 0 (zero) in the Value field.
6) Type UPC field - Percentage of failed rows is greater than 0 in the Description field.
7) Click OK.
m. Select File  Save Profile to save the changed profile.
n. Select Actions  Run Profile Report.
1) Type Third profile in the Description field.
2) Verify that Append to existing report is selected.
3) Click OK to close the Run Profile window.
The profile executes. A status of the execution is displayed.
The Report tab becomes active.
o. Review the Profile report.
1) Click
in front of dfConglomerate Gifts.
2) Click the BREAKFAST_ITEMS table.
3) Click the Business Rules sub-tab.
QUESTION: Were there violations to the business rule?
Answer:
Yes
QUESTION: If so, what records did not pass the rule check?
Answer:
Double-click the UPC Check item on the Business Rules sub-tab.
5-56
Chapter 5 MONITOR
4) Click the Alerts sub-tab.
The icon indicating that the alert was met is not in the UPC field. The alert is on a business
rule and not a specific field as is the case when you set a standard metric alert.
QUESTION: Was the alert triggered?
Answer:
Yes
p. Select File  Close Profile.
3. Creating a Monitoring Job for a Row-Based Rule
a. If necessary, select Start  All Programs  DataFlux  Data Management Studio 2.2.
b. Verify that the Home tab is selected.
c. Select Tools  Business Rules Manager  Basics Exercises.
d. Right-click the Tasks folder and select New Task.
1) Type UPC Check in the Name field.
2) Double-click the UPC Check rule to move it from the Available list to the Selected list.
3) With the rule highlighted in the Selected list, click Events.
a) Click Add in the Events window.
(1) Click
in the Select event field and select Log error to repository.
(2) Click Continue.
(3) Click
to move the two fields (PK, UPC) from Available fields to Selected fields.
(4) Click OK to close the Log Error To Repository Event Properties window.
b) Click OK to close the Events window.
4) Click OK to close the New Task window.
e. Right-click the Sources folder and select New Source.
1) Type Breakfast Items in the Name field.
2) Type Breakfast Items Table in the Description field.
3) Click OK to close the Source Properties window.
5.1 Business Rules
f. If necessary, expand the Tasks folder.
g. Right-click the UPC Check task and select Export to Data Job.
1) Verify that Data Source is selected.
2) Click Next.
next to the Input table field.
a) Click
(1) Expand the dfConglomerate Grocery data source.
(2) Click the BREAKFAST_ITEMS table.
(3) Click OK to close the Select Table window.
to move all the selected columns from the list.
b) Click
c) Double-click ID to move it from the Available list to the Selected list.
d) Double-click UPC to move it from the Available list to the Selected list.
3) Click Next.
4) Click
in the Data Source Field field for the PK task field and select ID.
5) Click Next.
6) Type Check UPC Field in the Description field.
7) Click
next to the Job name field.
a) Type Ch5E3_UPCCheck in the Name field.
b) If necessary, select Basics Exercises as the value for the Save in field.
c) Double-click the batch_jobs folder.
d) Click Save.
8) Click
in the Source field and select Breakfast Items.
9) Click Finish.
h. Review the data job.
1) Click the Home tab.
2) Verify that the Folders riser bar is selected.
3) Expand Basics Exercises  batch_jobs.
4) Double-click the Ch5E3_UPCCheck data job.
5) Click
.
i. Select Actions  Run Data Job.
5-57
5-58
Chapter 5 MONITOR
j. When the job completes, click the Log tab.
QUESTION: How many records were monitored?
Answer:
2617
QUESTION: How many records triggered the monitoring event?
Answer:
85
k. From the data job, select Tools  Monitoring Report  Basics Exercises.
l. Click the Trigger Values tab in the Details pane.
m. Click Close to close the Record Detail window.
n. Select File  Close to close the tab for the Monitor Viewer.
o. Select File  Close to close the tab for the data job.
p. Select File  Close to close the Business Rules Manager tab.
4. Creating a Monitoring Job for a Group-Based Rule
a. If necessary, select Start  All Programs  DataFlux  Data Management Studio 2.2.
b. Verify that the Home tab is selected.
c. Select Tools  Business Rules Manager  Basics Exercises.
d. Create a new field by right-clicking the Fields folder and selecting New Field.
1) Type BRAND in the Name field.
2) Type Brand Name in the Description field.
3) Click OK.
e. Create a new group-based rule.
1) Right-click the Rules folder and select New Rule.
2) Type Low Brand Count in the Name field.
3) Click Group in the Type area.
4) Click Group By.
a) Double-click BRAND to move it to the Selected list in the Group By window.
b) Click OK to close the Group By window.
5.1 Business Rules
5) Under Definition, click Builder on the Checks tab.
6) Click Compare a field to a value or to another field under Compare Entire Field in the
Step 1 area.
to move the check to the Step 2 area.
7) Click
8) Click the Field is equal to Value link.
The Get Field Information window appears.
a) Click
next to the Field field. The Select Field window appears.
b) Click
to expand the BRAND field.
c) Click
to expand the Standard folder.
d) Click COUNT.
e) Click OK to close the Select Field window.
9) Click
in the Check field and select less than.
10) Type 5 (five) in the Value field.
11) Click OK to close the Compare To Value Or Field window.
12) Click OK to close the Expression Builder window.
13) Click OK to close the New Rule window.
f. Right-click the Tasks folder and select New Task.
5-59
5-60
Chapter 5 MONITOR
1) Type Low Brand Count in the Name field.
2) Double-click the Low Brand Count rule to move it from the Available list to the Selected list.
3) With the rule highlighted in the Selected list, click Rule Details.
a) Click Add in the Events window.
(1) Click
in the Select event field and select Log error to repository.
(2) Click Continue.
(3) Double-click BRAND to move it from Available fields to Selected fields.
(4) Double-click SM$BRAND$COUNT to move it from Available fields to Selected
fields.
(5) Click OK to close the Log Error To Repository Event Properties window.
b) Click OK to close the Events window.
4) Click OK to close the New Task window.
g. Right-click the Low Brand Count task and select Export to Data Job.
1) Verify that Data Source is selected.
2) Click Next.
3) Click
next to the Input table field.
a) Expand the dfConglomerate Grocery data source.
b) Click the BREAKFAST_ITEMS table.
c) Click OK to close the Select Table window.
4) Click
to remove all selected fields.
5) Double-click BRAND to move it from the Available list to the Selected list.
6) Click Next.
7) Verify that the BRAND field from the data source is mapped to the task field of the same
name.
5.1 Business Rules
8) Click Next.
9) Type Low Brand Count Group Rule in the Description field.
10) Click
next to the Job name field.
a) Type Ch5E4_LowBrandCount in the Name field.
b) If necessary, select Basics Demos as the value for the Save in field.
c) Double-click the batch_jobs folder.
d) Click Save.
11) Click
in the Source field and select Breakfast Items.
12) Click Finish.
h. Click the Home tab.
i. Verify that the Folders riser bar is selected.
j. Expand Basics Exercises  batch_jobs.
k. Double-click the Ch5E4_LowBrandCount data job.
l. Click
.
m. Select Actions  Run Data Job.
n. When the job completes, click the Log tab.
A message was written to the log for each violation that was encountered.
QUESTION: How many aggregate records were monitored?
Answer:
255
QUESTION: How many records triggered the monitoring event?
Answer:
138
o. From the data job, select Tools  Monitoring Report  Basics Exercises.
p. Click the Trigger Values tab in the Details pane.
q. Click Close to close the Record Detail window.
r. Select File  Close to close the tab for the Monitor Viewer.
s. Select File  Close to close the tab for the data job.
t. Select File  Close to close the Business Rules Manager tab.
5-61
5-62
Chapter 5 MONITOR
Téléchargement