同意包老师的看法.特别是DATA如果是CLINIC DATA,更不能修改或删除.如果说[B]"数据仓库的形成是从原有数据库中或文本中进行提取和转换,形成数据仓库中新的事实表,在提取和转换中进行数据清洗,并非在原数据库中直接操作"[/B],那么数据清洗的称呼准确与否就值得讨论.
一般而论,Data Clean Up 是针对bad data or dirty data,在明确为什么BAD DATA会产生,如何IDENTIFY BAD DATA, 以及删除BAD DATA的后果分析之前,贸然进行DATA的删除与修改是比较冒险的行为.以下简单的介绍了Dirty Data 的产生:
Dirty Data
Most healthcare facilities have recognized that their Master Patient Index (MPI) has some issue with its data integrity. The most common form of dirty data is duplicate Medical Record Numbers being assigned to one person. The size of the problem varies among facilities but industry estimates put the number at somewhere between 3% to as high as 15+%. Other forms of dirty data include overlaps (one person has more than one enterprise identifier for him / her across an enterprise master person index), overlays (one person is assigned, in the master patient index, to another person’s record) and erroneous, invalid and / or default data stored in key identifying fields.
Causes
There are a number of reasons why duplicate records occur. During the patient registration process human errors can occur including: misspellings, missing data, use of nicknames and typos / transpositions to name a few. Identity fraud can result in duplicate records. As facilities are bought and sold, disparate information systems and data sets are merged. Converting data sets can result in missing, default and corrupt data. Finally, as MPI data integrity becomes compromised the easier it is for more duplicates to be created. When a database is cluttered with duplicates, the duplicate growth rate becomes exponential. |