Skip to main content

116 docs tagged with "data-set"

View all tags

Assigning Data Source Admins

When creating a Data Source, or anytime later (as a Data Source Admin), a user can assign additional Admins to the Data Source through the screen below. The logged-in user will, for obvious reasons, be automatically moved into the right panel and considered to be the Data Source Admin.

Assigning Data Source Admins

When creating a Data Source, or anytime later (as a Data Source Admin), a user can assign additional Admins to the Data Source through the screen below. The logged-in user will, for obvious reasons, be automatically moved into the right panel and considered to be the Data Source Admin.

Catalog Object Entitlements

Like data sources or Data Sets, users of a Catalog also need to be provided with defined roles (or entitlements) when it is created. These can also be changed when editing a Catalog. The following entitlements are available to be assigned:

Classifying a Data Set

A logged-in user with Read/Write or Admin Entitlement can Classify the Data Set through:

Classifying a Data Set

A logged-in user with Read/Write or Admin Entitlement can Classify the Data Set through:

Cloning & Deleting a Concept Parser Project

A Classification Project can be Cloned to provide the user with a means to tweak or change the inputs on the project and re-run it keeping the original project intact. This can be thought of as an A/B experiment option provided to users to experiment with their project.

Cloning a Resolve Project

A "Resolve Project" can be cloned to allow a user to tweak or change the Project's inputs and re-run it while keeping the original Project intact. This is an A/B experiment option provided to users for their Projects.

Creating a Classification Project

A user can view the current projects in the Tenant by going to the Data Classification Projects listing screen from the ‘Project’ option in the left navigation panel of the Classify module.

Creating a New Catalog

A user can create a new Catalog from the My Data Catalog section that appears on the Classify home screen, as shown below. They can also create a new Catalog from the Catalog List screen accessed from the left nav menu by clicking the “Create New Data Catalog” button.

Creating New Data Set

New Data Set(s) can be created by selecting the required Data Set content files OR by a Create All Job from the Data Source screen. Let us look at the first way below:

Creating New Data Set

New Data Set(s) can be created by selecting the required Data Set content files OR by a Create All Job from the Data Source screen. Let us look at the first way below:

Creating New Data Source Connection

Creating a new Data Source connection begins with choosing the Data Source Type and some other details as shown in the screen below. You can have multiple Data Sources feeding into the Classify product. This also requires entering the:

Creating New Data Source Connection

Creating a new Data Source connection begins with choosing the Data Source Type and some other details as shown in the screen below. You can have multiple Data Sources feeding into the Classify product. This also requires entering the:

Data Quality Exceptions in Golden Records

We’ve talked about the Data Quality of Golden Records in the earlier section. Apart from metrics and summary information on quality, the system also provides details of exceptions and rules that caused conflict pertaining to EACH Golden Record.

Data Quality of Golden Records

Now let’s talk about the Data Quality measure. For Data Management teams, It is important to gauge and improve the quality of data, especially for Golden Records which will be considered as a refined source of truth for the teams. To enable this, we’ve provided users with a holistic look at the Data Quality of the generated Golden Record Dataset.

Data Quality Rules for Entities

As explained in the earlier sections, the Resolve project involves using Business Entities to generate a Golden Record, ideally containing the most complete and up-to-date set of information after complex matching and merging of operations. To get the most out of this process, keeping the quality of data used in those Entities in check is important.

Data Set Object Roles & Entitlements

When creating a Data Set, the logged-in user needs to provide Entitlements to that Data Set to themself and other users associated with the Tenant. These Entitlements are:

Data Set Object Roles & Entitlements

When creating a Data Set, the logged-in user needs to provide Entitlements to that Data Set to themself and other users associated with the Tenant. These Entitlements are:

Data Set Relationships

Data Set Relationships can be accessed through the namesake tab (i.e., ‘Data Set Relationship’) after opening a Data Set. Access to this tab requires a minimum of Data Read Entitlement for the Data Set.

Data Set Relationships

Data Set Relationships can be accessed through the namesake tab (i.e., ‘Data Set Relationship’) after opening a Data Set. Access to this tab requires a minimum of Data Read Entitlement for the Data Set.

Data Set Sample

By clicking on the Data Set Sample tab, the user is taken to a screen where a sample of all the columns of the Data Set is shown. Note that Data Set Columns tagged as PII, (Personally Identifiable Information) will be masked. Columns are tagged PII not directly but by their concepts in Catalog which we’ll talk about in another section.

Data Set Sample

By clicking on the Data Set Sample tab, the user is taken to a screen where a sample of all the columns of the Data Set is shown. Note that Data Set Columns tagged as PII, (Personally Identifiable Information) will be masked. Columns are tagged PII not directly but by their concepts in Catalog which we’ll talk about in another section.

Dataset Attributes

After the user creates and registers a Data set, they can click on a dataset to be redirected to the main Data set page. This page will give key information about the Data Set.

Dataset Attributes

After the user creates and registers a Data set, they can click on a dataset to be redirected to the main Data set page. This page will give key information about the Data Set.

Dataset Attributes Feedback

In the Dataset Attributes tab, which opens as the default tab for a Data Set, the user can perform 2 main actions:

Dataset Attributes Feedback

In the Dataset Attributes tab, which opens as the default tab for a Data Set, the user can perform 2 main actions:

Deleting a Data Set

You can Delete a Data Set that you have no use for if you're having Dataset Admin rights to that Data Set. This is a Soft-Delete and the file is not physically deleted because Fluree Sense simply captures the meta-data from the physical data. The physical data will continue to reside in the appropriate Data Source.

Deleting a Data Set

You can Delete a Data Set that you have no use for if you're having Dataset Admin rights to that Data Set. This is a Soft-Delete and the file is not physically deleted because Fluree Sense simply captures the meta-data from the physical data. The physical data will continue to reside in the appropriate Data Source.

Deleting a Project

A user may wish to Delete a Resolve Project as part of a normal Cleanup. This is a soft delete, but currently, there is no way to retrieve the Project from the UI. Deletion of the Project removes it from Display in the project list.

Editing a Concept Parser Project

A Classification Project can be edited by any user who has Project Admin rights for that Project. To edit a Classification Project please follow the steps below. Remember that you do NOT need to make changes in all the steps but a specific workflow typically saves on pressing the ‘Next’ button unless it has an ‘Apply Changes’ etc., button available in it.

Editing a Data Set

Once a Data Set is added, it appears in the Data Set list screen. Depending on the processes that have run on it, you can view the Data Set columns, Sample, etc. If the Data Set registration job is complete, you will also be able to see the latest Concepts to which that Data Set’s columns are mapped.

Editing a Data Set

Once a Data Set is added, it appears in the Data Set list screen. Depending on the processes that have run on it, you can view the Data Set columns, Sample, etc. If the Data Set registration job is complete, you will also be able to see the latest Concepts to which that Data Set’s columns are mapped.

Editing a Data Source

You can edit a Data Source that you have created if you have a Data Source Admin role for that Data Source. Please follow the steps below to edit a Data Set. These are essentially the same steps as in the Create Data Source workflow. You may either just move to the Next step without making any edits in a specific screen, or make edits wherever you feel it is necessary.

Editing a Data Source

You can edit a Data Source that you have created if you have a Data Source Admin role for that Data Source. Please follow the steps below to edit a Data Set. These are essentially the same steps as in the Create Data Source workflow. You may either just move to the Next step without making any edits in a specific screen, or make edits wherever you feel it is necessary.

Editing Data Set Entitlements

In an earlier section, we looked at how Data Set Entitlements are set when creating a Data Set. However, it is quite possible that you may wish to edit those existing rights. This can be done from the ‘Data Entitlements’ tab in the Data Set detail view.

Editing Data Set Entitlements

In an earlier section, we looked at how Data Set Entitlements are set when creating a Data Set. However, it is quite possible that you may wish to edit those existing rights. This can be done from the ‘Data Entitlements’ tab in the Data Set detail view.

Exporting a Data Set

You can export a Data Set if you have access to it. Currently, the Export function just exports the Data Set summary.

Exporting a Data Set

You can export a Data Set if you have access to it. Currently, the Export function just exports the Data Set summary.

Fixing Tasks

Fixing Tasks, as the name suggests, are the Tasks to "fix" any final or remaining "Data Issues," where the Machine Learning model can't be of much help. This usually happens when a machine learning model has reached or passed a threshold limit of confidence, after which tuning or training would lead to diminishing returns.

Getting Started

Login to your account by accessing the URL provided to you and enter the provisioned User ID and password as shown below.

Getting Started

Login to your account by accessing the URL provided to you and enter the provisioned User ID and password as shown below.

Giving Feedback to Ad-hoc Mappings

Now that we have seen what ad-hoc mappings look like in the earlier section, let's check out how we can give feedback to these mappings. The process of feedback is almost the same at both the Semantic Object and Concept level. The only difference is that the feedback at Semantic Object is being given to Data Set mappings whereas at Concept Level is being given to Data Set column mappings.

Importing Catalog Structure

The user is also able to import complete Catalogs from a file. This may be a more practical way to create large Catalogs.

Importing Concept Mappings

In the earlier sections, we saw how a user can provide feedback and mappings through various means, including the most recent case where the user can provide training through a workflow.

Importing Rules in Bulk

Fluree Sense also provides an interface to create rules quickly and easily in bulk through import. You can import both Technical and Business rules in Bulk.

Importing Rules in Bulk

Fluree Sense also provides an interface to create rules quickly and easily in bulk through import. You can import both Technical and Business rules in Bulk.

Introduction to Concept Parser Projects

The Classify Product from Fluree Sense comes with another powerful and unique feature: Machine Learning or AI-led Data Parsing capability. In the Semantic Object Project, which we’ve seen earlier, we were defining a Classifier.

Introduction to Entities

An Entity in the Resolve module is the same as what we refer to as Semantic Objects in Classify. An Entity can be a uniquely identifiable person, institution or thing and is the business object which may be referenced by multiple data tables (or Data Sets as we call them). For example, let's say we have ‘Customer’ as an Entity, and we have a Data Set for ‘Customer Profile’ and another one for ‘Customer Address Information’. In this case, we may arrive at the conclusion that both data sets refer to the same Entity.

Introduction to Tasks

In this section, we’ll be talking about a specific type of Task we colloquially call ‘Catalog Task.’ In the system, this corresponds to two different types of Tasks:

Job Types

Both Classify and Resolve provide for Viewing of Jobs. A Job very simply is a process triggered in non-blocking or asynchronous fashion where the user can go on working and moving from one screen to another while the job completes its work in the background. In this way, a job may take from a couple of minutes to even hours at times. The performance of a Job depends on the complexity, availability of memory and computing power (essentially the cloud specs) and amount of data.

Job Types

Both Classify and Resolve provide for Viewing of Jobs. A Job very simply is a process triggered in non-blocking or asynchronous fashion where the user can go on working and moving from one screen to another while the job completes its work in the background. In this way, a job may take from a couple of minutes to even hours at times. The performance of a Job depends on the complexity, availability of memory and computing power (essentially the cloud specs) and amount of data.

Managing Catalogs

Once a Catalog is created, it can be edited as required by any user with a Catalog Admin role. Catalog Management provides for the following functionality:

Managing Project Tasks by Admin

In the earlier sections, we've seen how a Project Review, Approver, and Project Admin can provide feedback for Tasks in the Project's "Train Model" screens. Resolve Projects also have a dedicated Manage Project Tasks screen only accessible by the Project Admin.

Managing Synonyms

When we talk about Managing Synonyms, we’re essentially discussing the ability to provide feedback to Synonyms and Re-run the model. Existing Synonyms can be viewed and accessed to provide feedback by clicking on the Synonym count next to the Semantic Object or Concept for which the Synonyms have been created.

Other Data Quality Rule Views

The Fluree Sense Data Quality feature provides a 360 degree view of the Data Quality of your data. Not only can you view the Data at a Data Set level but also at the Catalog (Data Dictionary), Semantic Object or Concept level. Some of these views also depend on your licensing – for example the Catalog, Semantic Object and Concept level views will obviously only be visible if you have the Classify Product licensed.

Other Data Quality Rule Views

The Fluree Sense Data Quality feature provides a 360 degree view of the Data Quality of your data. Not only can you view the Data at a Data Set level but also at the Catalog (Data Dictionary), Semantic Object or Concept level. Some of these views also depend on your licensing – for example the Catalog, Semantic Object and Concept level views will obviously only be visible if you have the Classify Product licensed.

Overview of Classify

- Fluree Sense is a full end-to-end platform designed to Ingest, Classify, Resolve, and Consume Big Data.

Publishing Golden Records

Once the Golden records are generated where you feel you have the requisite level of confidence and quality, you can go ahead and publish them. Golden Records can be published any time after the first run of the Project. There is no system threshold, confidence level, etc. for publishing and we’ve left it to the users to decide when they want to publish their Golden Records Dataset.

Publishing Semantic Data Set

Once the user has run through Catalog Classification, they can Publish the ‘Semantic Data Set’ to get the benefit of their exercise. Let us understand this concept through an example.

Reassigning Catalog Tasks

Imagine a situation where a Task is assigned to a specific user, but that user is on leave or unable to work on those tasks. You’d probably re-assign it to a team member if a co-worker from the same department was there, right?

Refreshing & Re-profiling Data

A Data Set undergoes Registration and Profiling the first time it is registered. This is explained in detail in the Editing a Data Set section. However, in the practical world, data never stays constant. Often, a Data Source will be a changing one which will get updated periodically. Provided certain conditions are met, Fluree Sense provides the capability of being able to refresh your data and get the delta (changed) records ad-hoc or as per a pre-set schedule.

Refreshing & Re-profiling Data

A Data Set undergoes Registration and Profiling the first time it is registered. This is explained in detail in the Editing a Data Set section. However, in the practical world, data never stays constant. Often, a Data Source will be a changing one which will get updated periodically. Provided certain conditions are met, Fluree Sense provides the capability of being able to refresh your data and get the delta (changed) records ad-hoc or as per a pre-set schedule.

Registering / Profiling a Data Set

As discussed in the section in Creating Data Sets, once a new Data Set is created, the process for profiling and registering is triggered as well. This process happens asynchronously and in steps. In the initial step, the Data Set sample and attributes are loaded and displayed. Then, in the next step, the profiling of the Data Set is undertaken. Next, as the Classification task is run on the Data Set, Data Set Relationships are re-generated and DQ rules are re-run. While this happens, it is indicated through the progress bar/loader in various sections of the Data Set.

Registering / Profiling a Data Set

As discussed in the section in Creating Data Sets, once a new Data Set is created, the process for profiling and registering is triggered as well. This process happens asynchronously and in steps. In the initial step, the Data Set sample and attributes are loaded and displayed. Then, in the next step, the profiling of the Data Set is undertaken. Next, as the Classification task is run on the Data Set, Data Set Relationships are re-generated and DQ rules are re-run. While this happens, it is indicated through the progress bar/loader in various sections of the Data Set.

Related Projects

The ‘Related Projects’ tab shows the Projects in which this Data Set is in use in Classify. These may be projects for Semantic Object Classification as well as projects of the type Concept Paser.

Related Projects

The ‘Related Projects’ tab shows the Projects in which this Data Set is in use in Classify. These may be projects for Semantic Object Classification as well as projects of the type Concept Paser.

Rule Views at Dataset Level

A user can also analyze the Data Quality at the Dataset Level starting from the whole Dataset down to specific columns and then for each rule on that column.

Rule Views at Dataset Level

A user can also analyze the Data Quality at the Dataset Level starting from the whole Dataset down to specific columns and then for each rule on that column.

Running & Re-running a Project

This aspect is common to all projects in general. Projects can be re-run after completing the Tasks generated by them. There may be some validations and restrictions as to the minimum number of Tasks/All Tasks needing to be completed.

Running the Model

Another aspect that the user needs to be aware of is that whenever a Run model is activated, whether by Classifying a Dataset or by training a model at the Object or Concept Level, it triggers classification for the whole tenant. This is because the Concept is linked to other Concepts, Data Quality Rules, Data Sets and any change in that Concept cannot be independent. So, the changes occur across the Tenant in a holistic manner as determined by the machine learning model.

Semantic Objects Concepts

It will be useful to discuss a little about Semantic Objects and Concepts here. As we create the Catalog above, which is like a Data Dictionary of the business - it is pertinent to note the following:

Tagging of Data

There are two types of Tagging we need to know about as a user:

Technical View of Semantic Objects

The Classify System provides users with the flexibility to examine their Business Objects in a Technical View as well. As the name suggests, this view focuses more on Data to Column relationships.

Training a Concept Parser Project

Once the Project has completed its first ‘run’, the initial results will be available for viewing. Details of these are available in the section on Project Home Screen and Project Result. The important thing to note is that most projects won’t achieve a sufficient level of confidence in just the first run.

Training Catalog Generated Tasks

Catalog Task Training is somewhat like Project Task Training. However, there are some key differences and intricacies. So, let’s look at them.

Training Matching Tasks

You can access Tasks from the Project Home screen by clicking the Train Model icon in the Entities Resolved section of the Project Home Screen. Please check the Section on Viewing Project Home Screen to understand how the Project Home screen looks and works.

Training Merging Tasks

The Golden Record creation (i.e., “Merging”) model synthesizes the records within a cluster into a single record containing the best data from all records in the cluster. So, if there are three possible addresses from records from three different sources in a cluster, the “Merging” model will attempt to select the most likely accurate address out of the three.

Training Tasks in Bulk through Import

As we have seen in earlier sections, for bulk updates, importing tasks or feedback is the best method. In the case of Catalog Tasks, as well, we are providing the ‘Bulk Import ‘ feature. To use this feature:

Training Tasks in Bulk through Import

As we have seen in earlier sections, for bulk updates, importing tasks or feedback is the best method. In the case of Catalog Tasks, as well, we are providing the ‘Bulk Import ‘ feature. To use this feature:

Types of Data Sources

Fluree Sense allows different types of Data Sources and can take Data in the form of CSV as well as Files and from RDBMS Tables. Currently, Fluree Sense can support the following Data Sources:

Types of Data Sources

Fluree Sense allows different types of Data Sources and can take Data in the form of CSV as well as Files and from RDBMS Tables. Currently, Fluree Sense can support the following Data Sources:

Viewing Catalogs

All the active Catalogs appear in the Catalog List screen with their names and some other useful information as shown below. Users can access this screen from the ‘Catalog’ option in the left nav of Classify.

Viewing Data Sets

When the user clicks on the Data Set tab on the left menu, they will be directed to the main Data Set page. This page will include all the datasets that the user has access to, as well as some information about these datasets including:

Viewing Data Sets

When the user clicks on the Data Set tab on the left menu, they will be directed to the main Data Set page. This page will include all the datasets that the user has access to, as well as some information about these datasets including:

Viewing Data Sources

Fluree Sense allows users to access Data from various cloud-based and on-prem environments such as Databricks, Cloud Storage, Hadoop, Snowflake, or traditional RDBMS such as Microsoft SQL etc. In this section, we will explore the screen where you have a holistic view of all the data sources.

Viewing Data Sources

Fluree Sense allows users to access Data from various cloud-based and on-prem environments such as Databricks, Cloud Storage, Hadoop, Snowflake, or traditional RDBMS such as Microsoft SQL etc. In this section, we will explore the screen where you have a holistic view of all the data sources.

Viewing Entities Mastered

To view “Entities Mastered”, click on “View Results” icon (marked 1) in the lower right panel:

Viewing Entities Resolved

Now, let's look at the results of the "Entity Resolution" model. You can access the results by clicking the eyeglass or "View Results" icon in the "Entities Resolved" panel.

Viewing Golden Records Lineage & Relationships

Another important view of Golden Records that users may want to see is the Golden Records Lineage view. As the name suggests, this view shows which specific records across the source systems have been combined to create the Golden Records after the resolving and mastering operations.

Viewing Project Home Screen

Let's circle back to the Project Creation Flow. After the user has mapped the details and "Run" the Project, it is sent as a job to the Cluster. It may take up to a minute for the Job to move to the processing queue and the progress display to appear on the screen. Once the Job starts, the user can see the progress through various stages of the Resolve project through progress bars with text information across the result areas.

Viewing Project Home Screen

Once we’ve Run the Project, it is sent as a Job to the Cluster. It may take up to a minute for the Job to be moved to the processing queue and the progress display to appear on the screen. Once the Job starts, the user can see the progress through various stages of the Classify project through progress-bars, with text information in the result areas.

Viewing Project Home Screen

Once we’ve run the Project, it is sent as a job to the Cluster. It may take up to a minute for the Job to be moved to the processing queue and the progress display to appear on the screen. Once the Job starts, the user can see the progress through various stages of the ‘Concept Parser’ project through progress bars with text information in the result areas.

Viewing Project Results

Click on the icon marked \4\] in the [Project Home Screen image to come to the Project Results screen. This screen shows all the results of the project which are really the model’s predicted values for the Classifier.

Viewing Project Results

Click on the icon marked \4\] in the [Project Home Screen image to come to the Project Results screen. This screen shows all the results of the project which are really the model’s predicted values for the Classifier.

Viewing Resolve Project Confidence

There are two important measures related to Golden Records. One is the Model Confidence, and the other is Data Quality. Let us look at Confidence first. The Confidence is shown separately for Entities Resolved and Entities Mastered. The Model Confidence in Resolve, and typically across the product is split into High, Medium and Low confidence records, which together give the combined confidence figure.

Viewing Semantic Object Model

The user can also view the Semantic Object Model from the Technical View tab of the Semantic Object. This is a powerful feature which allows you to understand the emerging Data Model by providing a view of how the physical data or tables map to the Business Object, which in this case is ‘Customer.’