Assigning Data Source Admins
When creating a Data Source, or anytime later (as a Data Source Admin), a user can assign additional Admins to the Data Source through the screen below. The logged-in user will, for obvious reasons, be automatically moved into the right panel and considered to be the Data Source Admin.
Assigning Data Source Admins
When creating a Data Source, or anytime later (as a Data Source Admin), a user can assign additional Admins to the Data Source through the screen below. The logged-in user will, for obvious reasons, be automatically moved into the right panel and considered to be the Data Source Admin.
Catalog Object Entitlements
Like data sources or Data Sets, users of a Catalog also need to be provided with defined roles (or entitlements) when it is created. These can also be changed when editing a Catalog. The following entitlements are available to be assigned:
Classifying a Data Set
A logged-in user with Read/Write or Admin Entitlement can Classify the Data Set through:
Classifying a Data Set
A logged-in user with Read/Write or Admin Entitlement can Classify the Data Set through:
Cloning & Deleting a Concept Parser Project
A Classification Project can be Cloned to provide the user with a means to tweak or change the inputs on the project and re-run it keeping the original project intact. This can be thought of as an A/B experiment option provided to users to experiment with their project.
Cloning a Resolve Project
A "Resolve Project" can be cloned to allow a user to tweak or change the Project's inputs and re-run it while keeping the original Project intact. This is an A/B experiment option provided to users for their Projects.
Creating a Classification Project
A user can view the current projects in the Tenant by going to the Data Classification Projects listing screen from the ‘Project’ option in the left navigation panel of the Classify module.
Creating a Concept Parser Project - Initial Setup
The prerequisites for creating a Concept Parser are the following:
Creating a Concept Parser Project – Training & Project Data
Now that the 4-step initial set-up is done, let’s examine the next steps.
Creating a New Catalog
A user can create a new Catalog from the My Data Catalog section that appears on the Classify home screen, as shown below. They can also create a new Catalog from the Catalog List screen accessed from the left nav menu by clicking the “Create New Data Catalog” button.
Creating an Entity
To create a new Entity, please follow the steps listed below:
Creating New Data Set
New Data Set(s) can be created by selecting the required Data Set content files OR by a Create All Job from the Data Source screen. Let us look at the first way below:
Creating New Data Set
New Data Set(s) can be created by selecting the required Data Set content files OR by a Create All Job from the Data Source screen. Let us look at the first way below:
Creating New Data Source Connection
Creating a new Data Source connection begins with choosing the Data Source Type and some other details as shown in the screen below. You can have multiple Data Sources feeding into the Classify product. This also requires entering the:
Creating New Data Source Connection
Creating a new Data Source connection begins with choosing the Data Source Type and some other details as shown in the screen below. You can have multiple Data Sources feeding into the Classify product. This also requires entering the:
Data Quality Exceptions in Golden Records
We’ve talked about the Data Quality of Golden Records in the earlier section. Apart from metrics and summary information on quality, the system also provides details of exceptions and rules that caused conflict pertaining to EACH Golden Record.
Data Quality of Golden Records
Now let’s talk about the Data Quality measure. For Data Management teams, It is important to gauge and improve the quality of data, especially for Golden Records which will be considered as a refined source of truth for the teams. To enable this, we’ve provided users with a holistic look at the Data Quality of the generated Golden Record Dataset.
Data Quality Rules for Entities
As explained in the earlier sections, the Resolve project involves using Business Entities to generate a Golden Record, ideally containing the most complete and up-to-date set of information after complex matching and merging of operations. To get the most out of this process, keeping the quality of data used in those Entities in check is important.
Data Set Object Roles & Entitlements
When creating a Data Set, the logged-in user needs to provide Entitlements to that Data Set to themself and other users associated with the Tenant. These Entitlements are:
Data Set Object Roles & Entitlements
When creating a Data Set, the logged-in user needs to provide Entitlements to that Data Set to themself and other users associated with the Tenant. These Entitlements are:
Data Set Relationships
Data Set Relationships can be accessed through the namesake tab (i.e., ‘Data Set Relationship’) after opening a Data Set. Access to this tab requires a minimum of Data Read Entitlement for the Data Set.
Data Set Relationships
Data Set Relationships can be accessed through the namesake tab (i.e., ‘Data Set Relationship’) after opening a Data Set. Access to this tab requires a minimum of Data Read Entitlement for the Data Set.
Data Set Sample
By clicking on the Data Set Sample tab, the user is taken to a screen where a sample of all the columns of the Data Set is shown. Note that Data Set Columns tagged as PII, (Personally Identifiable Information) will be masked. Columns are tagged PII not directly but by their concepts in Catalog which we’ll talk about in another section.
Data Set Sample
By clicking on the Data Set Sample tab, the user is taken to a screen where a sample of all the columns of the Data Set is shown. Note that Data Set Columns tagged as PII, (Personally Identifiable Information) will be masked. Columns are tagged PII not directly but by their concepts in Catalog which we’ll talk about in another section.
Dataset Attributes
After the user creates and registers a Data set, they can click on a dataset to be redirected to the main Data set page. This page will give key information about the Data Set.
Dataset Attributes
After the user creates and registers a Data set, they can click on a dataset to be redirected to the main Data set page. This page will give key information about the Data Set.
Dataset Attributes Feedback
In the Dataset Attributes tab, which opens as the default tab for a Data Set, the user can perform 2 main actions:
Dataset Attributes Feedback
In the Dataset Attributes tab, which opens as the default tab for a Data Set, the user can perform 2 main actions:
Deleting a Data Set
You can Delete a Data Set that you have no use for if you're having Dataset Admin rights to that Data Set. This is a Soft-Delete and the file is not physically deleted because Fluree Sense simply captures the meta-data from the physical data. The physical data will continue to reside in the appropriate Data Source.
Deleting a Data Set
You can Delete a Data Set that you have no use for if you're having Dataset Admin rights to that Data Set. This is a Soft-Delete and the file is not physically deleted because Fluree Sense simply captures the meta-data from the physical data. The physical data will continue to reside in the appropriate Data Source.
Deleting a Project
A user may wish to Delete a Resolve Project as part of a normal Cleanup. This is a soft delete, but currently, there is no way to retrieve the Project from the UI. Deletion of the Project removes it from Display in the project list.
Editing a Concept Parser Project
A Classification Project can be edited by any user who has Project Admin rights for that Project. To edit a Classification Project please follow the steps below. Remember that you do NOT need to make changes in all the steps but a specific workflow typically saves on pressing the ‘Next’ button unless it has an ‘Apply Changes’ etc., button available in it.
Editing a Data Set
Once a Data Set is added, it appears in the Data Set list screen. Depending on the processes that have run on it, you can view the Data Set columns, Sample, etc. If the Data Set registration job is complete, you will also be able to see the latest Concepts to which that Data Set’s columns are mapped.
Editing a Data Set
Once a Data Set is added, it appears in the Data Set list screen. Depending on the processes that have run on it, you can view the Data Set columns, Sample, etc. If the Data Set registration job is complete, you will also be able to see the latest Concepts to which that Data Set’s columns are mapped.
Editing a Data Source
You can edit a Data Source that you have created if you have a Data Source Admin role for that Data Source. Please follow the steps below to edit a Data Set. These are essentially the same steps as in the Create Data Source workflow. You may either just move to the Next step without making any edits in a specific screen, or make edits wherever you feel it is necessary.
Editing a Data Source
You can edit a Data Source that you have created if you have a Data Source Admin role for that Data Source. Please follow the steps below to edit a Data Set. These are essentially the same steps as in the Create Data Source workflow. You may either just move to the Next step without making any edits in a specific screen, or make edits wherever you feel it is necessary.
Editing Data Set Entitlements
In an earlier section, we looked at how Data Set Entitlements are set when creating a Data Set. However, it is quite possible that you may wish to edit those existing rights. This can be done from the ‘Data Entitlements’ tab in the Data Set detail view.
Editing Data Set Entitlements
In an earlier section, we looked at how Data Set Entitlements are set when creating a Data Set. However, it is quite possible that you may wish to edit those existing rights. This can be done from the ‘Data Entitlements’ tab in the Data Set detail view.
Editing Golden Records Manually
Golden Records get edited in two ways.
Export & Publish Concept Parser Results
Once the Project has reached a requisite level of confidence and the predictions are ready to be used, the User may want to export or publish them.
Exporting a Data Set
You can export a Data Set if you have access to it. Currently, the Export function just exports the Data Set summary.
Exporting a Data Set
You can export a Data Set if you have access to it. Currently, the Export function just exports the Data Set summary.
Fixing Tasks
Fixing Tasks, as the name suggests, are the Tasks to "fix" any final or remaining "Data Issues," where the Machine Learning model can't be of much help. This usually happens when a machine learning model has reached or passed a threshold limit of confidence, after which tuning or training would lead to diminishing returns.
Four Eyes Check & Entitlements
There are three types of roles in the system for any Project:
Four Eyes Check & Entitlements
Let us look at the Four Eyes and Entitlement aspect of Step 2 of Project Creation in some more detail here.
Four Eyes Check & Entitlements
There are three types of roles in the system for any Project:
Getting Started
Login to your account by accessing the URL provided to you and enter the provisioned User ID and password as shown below.
Getting Started
Login to your account by accessing the URL provided to you and enter the provisioned User ID and password as shown below.
Giving Feedback to Ad-hoc Mappings
Now that we have seen what ad-hoc mappings look like in the earlier section, let's check out how we can give feedback to these mappings. The process of feedback is almost the same at both the Semantic Object and Concept level. The only difference is that the feedback at Semantic Object is being given to Data Set mappings whereas at Concept Level is being given to Data Set column mappings.
Global Search
Searching for a ‘Search’ Term
Importing Catalog Structure
The user is also able to import complete Catalogs from a file. This may be a more practical way to create large Catalogs.
Importing Concept Mappings
In the earlier sections, we saw how a user can provide feedback and mappings through various means, including the most recent case where the user can provide training through a workflow.
Importing Rules in Bulk
Fluree Sense also provides an interface to create rules quickly and easily in bulk through import. You can import both Technical and Business rules in Bulk.
Importing Rules in Bulk
Fluree Sense also provides an interface to create rules quickly and easily in bulk through import. You can import both Technical and Business rules in Bulk.
Introduction To Classification Model Training
In earlier sections for Data Set and Catalog, we saw a few ways of Classification. These are listed as follows:
Introduction to Classification Projects
There are two types of Classification Projects:
Introduction to Concept Parser Projects
The Classify Product from Fluree Sense comes with another powerful and unique feature: Machine Learning or AI-led Data Parsing capability. In the Semantic Object Project, which we’ve seen earlier, we were defining a Classifier.
Introduction to Entities
An Entity in the Resolve module is the same as what we refer to as Semantic Objects in Classify. An Entity can be a uniquely identifiable person, institution or thing and is the business object which may be referenced by multiple data tables (or Data Sets as we call them). For example, let's say we have ‘Customer’ as an Entity, and we have a Data Set for ‘Customer Profile’ and another one for ‘Customer Address Information’. In this case, we may arrive at the conclusion that both data sets refer to the same Entity.
Introduction to Tasks
In this section, we’ll be talking about a specific type of Task we colloquially call ‘Catalog Task.’ In the system, this corresponds to two different types of Tasks:
Job Types
Both Classify and Resolve provide for Viewing of Jobs. A Job very simply is a process triggered in non-blocking or asynchronous fashion where the user can go on working and moving from one screen to another while the job completes its work in the background. In this way, a job may take from a couple of minutes to even hours at times. The performance of a Job depends on the complexity, availability of memory and computing power (essentially the cloud specs) and amount of data.
Job Types
Both Classify and Resolve provide for Viewing of Jobs. A Job very simply is a process triggered in non-blocking or asynchronous fashion where the user can go on working and moving from one screen to another while the job completes its work in the background. In this way, a job may take from a couple of minutes to even hours at times. The performance of a Job depends on the complexity, availability of memory and computing power (essentially the cloud specs) and amount of data.
Key Terms and Concepts
- Tenant:
Key Terms and Concepts
- Tenant:
Managing Catalogs
Once a Catalog is created, it can be edited as required by any user with a Catalog Admin role. Catalog Management provides for the following functionality:
Managing Project Tasks by Admin
In the earlier sections, we've seen how a Project Review, Approver, and Project Admin can provide feedback for Tasks in the Project's "Train Model" screens. Resolve Projects also have a dedicated Manage Project Tasks screen only accessible by the Project Admin.
Managing Synonyms
When we talk about Managing Synonyms, we’re essentially discussing the ability to provide feedback to Synonyms and Re-run the model. Existing Synonyms can be viewed and accessed to provide feedback by clicking on the Synonym count next to the Semantic Object or Concept for which the Synonyms have been created.
Other Data Quality Rule Views
The Fluree Sense Data Quality feature provides a 360 degree view of the Data Quality of your data. Not only can you view the Data at a Data Set level but also at the Catalog (Data Dictionary), Semantic Object or Concept level. Some of these views also depend on your licensing – for example the Catalog, Semantic Object and Concept level views will obviously only be visible if you have the Classify Product licensed.
Other Data Quality Rule Views
The Fluree Sense Data Quality feature provides a 360 degree view of the Data Quality of your data. Not only can you view the Data at a Data Set level but also at the Catalog (Data Dictionary), Semantic Object or Concept level. Some of these views also depend on your licensing – for example the Catalog, Semantic Object and Concept level views will obviously only be visible if you have the Classify Product licensed.
Overview of Classify
- Fluree Sense is a full end-to-end platform designed to Ingest, Classify, Resolve, and Consume Big Data.
Profile Management & Header Controls
Any user of Fluree Sense can manage their profile through the following Steps after logging in.
Publishing Golden Records
Once the Golden records are generated where you feel you have the requisite level of confidence and quality, you can go ahead and publish them. Golden Records can be published any time after the first run of the Project. There is no system threshold, confidence level, etc. for publishing and we’ve left it to the users to decide when they want to publish their Golden Records Dataset.
Publishing Semantic Data Set
Once the user has run through Catalog Classification, they can Publish the ‘Semantic Data Set’ to get the benefit of their exercise. Let us understand this concept through an example.
Reassigning Catalog Tasks
Imagine a situation where a Task is assigned to a specific user, but that user is on leave or unable to work on those tasks. You’d probably re-assign it to a team member if a co-worker from the same department was there, right?
Refreshing & Re-profiling Data
A Data Set undergoes Registration and Profiling the first time it is registered. This is explained in detail in the Editing a Data Set section. However, in the practical world, data never stays constant. Often, a Data Source will be a changing one which will get updated periodically. Provided certain conditions are met, Fluree Sense provides the capability of being able to refresh your data and get the delta (changed) records ad-hoc or as per a pre-set schedule.
Refreshing & Re-profiling Data
A Data Set undergoes Registration and Profiling the first time it is registered. This is explained in detail in the Editing a Data Set section. However, in the practical world, data never stays constant. Often, a Data Source will be a changing one which will get updated periodically. Provided certain conditions are met, Fluree Sense provides the capability of being able to refresh your data and get the delta (changed) records ad-hoc or as per a pre-set schedule.
Registering / Profiling a Data Set
As discussed in the section in Creating Data Sets, once a new Data Set is created, the process for profiling and registering is triggered as well. This process happens asynchronously and in steps. In the initial step, the Data Set sample and attributes are loaded and displayed. Then, in the next step, the profiling of the Data Set is undertaken. Next, as the Classification task is run on the Data Set, Data Set Relationships are re-generated and DQ rules are re-run. While this happens, it is indicated through the progress bar/loader in various sections of the Data Set.
Registering / Profiling a Data Set
As discussed in the section in Creating Data Sets, once a new Data Set is created, the process for profiling and registering is triggered as well. This process happens asynchronously and in steps. In the initial step, the Data Set sample and attributes are loaded and displayed. Then, in the next step, the profiling of the Data Set is undertaken. Next, as the Classification task is run on the Data Set, Data Set Relationships are re-generated and DQ rules are re-run. While this happens, it is indicated through the progress bar/loader in various sections of the Data Set.
Related Projects
The ‘Related Projects’ tab shows the Projects in which this Data Set is in use in Classify. These may be projects for Semantic Object Classification as well as projects of the type Concept Paser.
Related Projects
The ‘Related Projects’ tab shows the Projects in which this Data Set is in use in Classify. These may be projects for Semantic Object Classification as well as projects of the type Concept Paser.
Rule Views at Catalog Level
Let us check out the Data Quality Rule Views at the Catalog Level. These include:
Rule Views at Catalog Level
Let us check out the Data Quality Rule Views at the Catalog Level. These include:
Rule Views at Dataset Level
A user can also analyze the Data Quality at the Dataset Level starting from the whole Dataset down to specific columns and then for each rule on that column.
Rule Views at Dataset Level
A user can also analyze the Data Quality at the Dataset Level starting from the whole Dataset down to specific columns and then for each rule on that column.
Running & Re-running a Project
This aspect is common to all projects in general. Projects can be re-run after completing the Tasks generated by them. There may be some validations and restrictions as to the minimum number of Tasks/All Tasks needing to be completed.
Running the Model
Another aspect that the user needs to be aware of is that whenever a Run model is activated, whether by Classifying a Dataset or by training a model at the Object or Concept Level, it triggers classification for the whole tenant. This is because the Concept is linked to other Concepts, Data Quality Rules, Data Sets and any change in that Concept cannot be independent. So, the changes occur across the Tenant in a holistic manner as determined by the machine learning model.
Semantic Objects Concepts
It will be useful to discuss a little about Semantic Objects and Concepts here. As we create the Catalog above, which is like a Data Dictionary of the business - it is pertinent to note the following:
System Configuration
Supported Platforms
Tagging of Data
There are two types of Tagging we need to know about as a user:
Technical View of Semantic Objects
The Classify System provides users with the flexibility to examine their Business Objects in a Technical View as well. As the name suggests, this view focuses more on Data to Column relationships.
Training a Concept Parser Project
Once the Project has completed its first ‘run’, the initial results will be available for viewing. Details of these are available in the section on Project Home Screen and Project Result. The important thing to note is that most projects won’t achieve a sufficient level of confidence in just the first run.
Training at Concept Level
Training the Model at Concept Level - through workflow
Training at Semantic Object Level
Training the Model at Object Level - through workflow:
Training Catalog Generated Tasks
Catalog Task Training is somewhat like Project Task Training. However, there are some key differences and intricacies. So, let’s look at them.
Training Matching Tasks
You can access Tasks from the Project Home screen by clicking the Train Model icon in the Entities Resolved section of the Project Home Screen. Please check the Section on Viewing Project Home Screen to understand how the Project Home screen looks and works.
Training Merging Tasks
The Golden Record creation (i.e., “Merging”) model synthesizes the records within a cluster into a single record containing the best data from all records in the cluster. So, if there are three possible addresses from records from three different sources in a cluster, the “Merging” model will attempt to select the most likely accurate address out of the three.
Training Tasks in Bulk through Import
As we have seen in earlier sections, for bulk updates, importing tasks or feedback is the best method. In the case of Catalog Tasks, as well, we are providing the ‘Bulk Import ‘ feature. To use this feature:
Training Tasks in Bulk through Import
As we have seen in earlier sections, for bulk updates, importing tasks or feedback is the best method. In the case of Catalog Tasks, as well, we are providing the ‘Bulk Import ‘ feature. To use this feature:
Types of Data Sources
Fluree Sense allows different types of Data Sources and can take Data in the form of CSV as well as Files and from RDBMS Tables. Currently, Fluree Sense can support the following Data Sources:
Types of Data Sources
Fluree Sense allows different types of Data Sources and can take Data in the form of CSV as well as Files and from RDBMS Tables. Currently, Fluree Sense can support the following Data Sources:
User Management
Types of Users and Roles available
Viewing Catalogs
All the active Catalogs appear in the Catalog List screen with their names and some other useful information as shown below. Users can access this screen from the ‘Catalog’ option in the left nav of Classify.
Viewing Data Sets
When the user clicks on the Data Set tab on the left menu, they will be directed to the main Data Set page. This page will include all the datasets that the user has access to, as well as some information about these datasets including:
Viewing Data Sets
When the user clicks on the Data Set tab on the left menu, they will be directed to the main Data Set page. This page will include all the datasets that the user has access to, as well as some information about these datasets including:
Viewing Data Sources
Fluree Sense allows users to access Data from various cloud-based and on-prem environments such as Databricks, Cloud Storage, Hadoop, Snowflake, or traditional RDBMS such as Microsoft SQL etc. In this section, we will explore the screen where you have a holistic view of all the data sources.
Viewing Data Sources
Fluree Sense allows users to access Data from various cloud-based and on-prem environments such as Databricks, Cloud Storage, Hadoop, Snowflake, or traditional RDBMS such as Microsoft SQL etc. In this section, we will explore the screen where you have a holistic view of all the data sources.
Viewing Entities Mastered
To view “Entities Mastered”, click on “View Results” icon (marked 1) in the lower right panel:
Viewing Entities Resolved
Now, let's look at the results of the "Entity Resolution" model. You can access the results by clicking the eyeglass or "View Results" icon in the "Entities Resolved" panel.
Viewing Golden Records Lineage & Relationships
Another important view of Golden Records that users may want to see is the Golden Records Lineage view. As the name suggests, this view shows which specific records across the source systems have been combined to create the Golden Records after the resolving and mastering operations.
Viewing Project Home Screen
Let's circle back to the Project Creation Flow. After the user has mapped the details and "Run" the Project, it is sent as a job to the Cluster. It may take up to a minute for the Job to move to the processing queue and the progress display to appear on the screen. Once the Job starts, the user can see the progress through various stages of the Resolve project through progress bars with text information across the result areas.
Viewing Project Home Screen
Once we’ve Run the Project, it is sent as a Job to the Cluster. It may take up to a minute for the Job to be moved to the processing queue and the progress display to appear on the screen. Once the Job starts, the user can see the progress through various stages of the Classify project through progress-bars, with text information in the result areas.
Viewing Project Home Screen
Once we’ve run the Project, it is sent as a job to the Cluster. It may take up to a minute for the Job to be moved to the processing queue and the progress display to appear on the screen. Once the Job starts, the user can see the progress through various stages of the ‘Concept Parser’ project through progress bars with text information in the result areas.
Viewing Project Results
Click on the icon marked \4\] in the [Project Home Screen image to come to the Project Results screen. This screen shows all the results of the project which are really the model’s predicted values for the Classifier.
Viewing Project Results
Click on the icon marked \4\] in the [Project Home Screen image to come to the Project Results screen. This screen shows all the results of the project which are really the model’s predicted values for the Classifier.
Viewing Resolve Project Confidence
There are two important measures related to Golden Records. One is the Model Confidence, and the other is Data Quality. Let us look at Confidence first. The Confidence is shown separately for Entities Resolved and Entities Mastered. The Model Confidence in Resolve, and typically across the product is split into High, Medium and Low confidence records, which together give the combined confidence figure.
Viewing Semantic Object Model
The user can also view the Semantic Object Model from the Technical View tab of the Semantic Object. This is a powerful feature which allows you to understand the emerging Data Model by providing a view of how the physical data or tables map to the Business Object, which in this case is ‘Customer.’