找回密码
 欢迎注册
搜索
热搜: 活动 交友 discuz
楼主: ermulong

如何看待“数据清洗”

[复制链接]
发表于 2004-11-29 09:15:54 | 显示全部楼层

如何看待“数据清洗”

同意包老师的看法.特别是DATA如果是CLINIC DATA,更不能修改或删除.如果说[B]"数据仓库的形成是从原有数据库中或文本中进行提取和转换,形成数据仓库中新的事实表,在提取和转换中进行数据清洗,并非在原数据库中直接操作"[/B],那么数据清洗的称呼准确与否就值得讨论.
一般而论,Data Clean Up 是针对bad data or dirty data,在明确为什么BAD DATA会产生,如何IDENTIFY BAD DATA, 以及删除BAD DATA的后果分析之前,贸然进行DATA的删除与修改是比较冒险的行为.以下简单的介绍了Dirty Data 的产生:

Dirty Data

Most healthcare facilities have recognized that their Master Patient Index (MPI) has some issue with its data integrity.  The most common form of dirty data is duplicate Medical Record Numbers being assigned to one person.  The size of the problem varies among facilities but industry estimates put the number at somewhere between 3% to as high as 15+%.  Other forms of dirty data include overlaps (one person has more than one enterprise identifier for him / her across an enterprise master person index), overlays (one person is assigned, in the master patient index, to another person’s record) and erroneous, invalid and / or default data stored in key identifying fields.


Causes

There are a number of reasons why duplicate records occur.  During the patient registration process human errors can occur including: misspellings, missing data, use of nicknames and typos / transpositions to name a few.  Identity fraud can result in duplicate records.  As facilities are bought and sold, disparate information systems and data sets are merged.  Converting data sets can result in missing, default and corrupt data.  Finally, as MPI data integrity becomes compromised the easier it is for more duplicates to be created. When a database is cluttered with duplicates, the duplicate growth rate becomes exponential.
发表于 2004-11-30 18:43:44 | 显示全部楼层

如何看待“数据清洗”

数据合理不合理都是档案,而档案是不能改的,因为修改因人而异,你怎么保证你就是正确的呢?如果明天我去接手,我说你改的都不正确,又改一遍,那数据最后是怎样的呢?
发表于 2010-3-30 10:32:23 | 显示全部楼层
17# tyq

首先,您需要搞清楚数据清洗的目的,数据清洗不是为了便于存储,不是为了节省硬盘空间,而是为了得到感兴趣的信息模式,为了下一步的数据挖掘打好基础。

不同的人,对于相同的数据,其侧重点都不一定相同,所以数据是死的,而人是活的,档案本身是没有价值的,只有利用了,才能转化成价值。
您需要登录后才可以回帖 登录 | 欢迎注册

本版积分规则

快速回复 返回顶部 返回列表