Enterprise data is critical to business success. Companies around the world understand this and leverage platforms such as Snowflake to make the most of information streaming in from various sources.


Copyright: venturebeat.com – “What is dirty data? Sources, impact, key strategies”


However, more often than not, this data can become “dirty.” In essence, it could, at any stage of the pipeline, lose key attributes such as accuracy, accessibility and completeness (among others), becoming unsuitable for downstream use initially targeted by the organization.

“Some data can be objectively wrong. Data fields can be left blank, misspelled or inaccurate names, addresses, phone numbers can be provided and duplicate information…are some examples. However, whether that data can be classed as dirty very much depends on context.

For example, a missing or incorrect email address is not required to complete a retail store sale, but a marketing team who wishes to contact customers via email to send promotional information will classify that same data as dirty,” Jason Medd, research director at Gartner, told VentureBeat.

In addition, the untimely and inconsistent flow of information can also add to the problem of dirty data within an organization. The latter particularly occurs in the case of merging information from two or more systems using different standards. For instance, if one system classifies names as a single field while the other divides them into two, only one will be considered valid, with the other requiring cleansing.

Sources of dirty data

Overall, the entire issue boils down to five key sources:


As Medd explained, dirty data can occur due to human errors upon entry. This could be an outcome of shoddy work from the person entering the data, the lack of training or poorly defined roles and responsibilities. Many organizations do not even consider establishing a data-focused collaborative culture


Process oversight can also lead to cases of dirty data. For instance, poorly defined data lifecycles could lead to the use of outdated information across systems (people change numbers, addresses over time). There could also be issues due to the lack of data quality firewalls for critical data capture points or the lack of clear cross-functional data processes.


Technology glitches such as programming errors or poorly maintained internal/external interfaces can affect data quality and consistency. Many organizations can even miss out on deploying data quality tools or end up keeping multiple varying copies of the same data due to system fragmentation.[…]

Read more: www.venturebeat.com


Register to the SwissCognitive - AI Community