小程序
传感搜
传感圈

How Census Data Put Trans Children at Risk

2022-09-20
关注

Every decade, the U.S. Census Bureau counts the people in the United States, trying to observe the balance between gathering accurate information and protecting the privacy of the people described in that data. But current technology can reveal a person’s transgender identity by linking seemingly anonymized information such as their neighborhood and age to discover that their sex was reported differently in successive censuses. The ability to deanonymize gender and other data could spell disaster for trans people and families living in states that seek to criminalize them.

In places like Texas, where families seeking medical care for trans children can be accused of child abuse, the state would need to know which teenagers are trans to carry out their investigations. We worried that census data could be used to make this kind of investigation and punishment easier. Might a weakness in how publicly released data sets are anonymized be exploited to find trans kids—and to punish them and their families? This is a similar concern that underscored the public outcry in 2018 over the census asking people to reveal their citizenship—that the data would be used to find people living in the U.S. illegally to punish them.

Using our expertise in data science and data ethics, we took simulated data designed to mimic the data sets that the Census Bureau releases publicly and tried to reidentify trans teenagers, or at least narrow down where they might live, and unfortunately, we succeeded. With the data-anonymization approach the Census Bureau used in 2010, we were able to identify 605 trans kids. Thankfully, the Census Bureau is undertaking a new differential-privacy approach that will improve privacy overall, but it is still a work in progress. When we reviewed the most recent data released, we found the bureau’s new approach cuts the identification rate by 70 percent—a lot better, but still with room for improvement.

Even as researchers who use census data to answer questions about life in the U.S. for our work, we believe strongly that privacy matters. The bureau is currently undertaking a public comment period on designing the 2030 census. Submissions could shape how the census is undertaken, and how the bureau will go about anonymizing data. Here is why this is important.

The federal government gathers census data to make decisions about things like the size and shape of congressional districts, or how to disburse funding. Yet, government agencies aren’t the only people who use the data. Researchers in a variety of fields, such as economics and public health, use the publicly released information to study the state of the nation and make policy recommendations.

But the risks of deanonymizing data are real, and not just for trans children. In a world where private data collection and access to powerful computing systems are increasingly ubiquitous, it might be possible to unwind the privacy protections that the Census Bureau builds into the data. Perhaps most famously, computer scientist Latanya Sweeney showed that almost 90 percent of U.S. citizens could be reidentified from just their ZIP code, date of birth and assigned sex.

In August of 2021, the Census Bureau responded. The organization used the cryptographer-preferred approach of differential privacy to protect its redistricting data. Mathematicians and computer scientists have been drawn to the mathematical elegance of this approach, which involves intentionally introducing a controlled amount of error into key census counts and then cleaning up the results to ensure they remain internally consistent. For example, if the census counted precisely 16,147 people who identified as Native American in a specific county, it might report a number that is close but different, like 16,171. This sounds simple, but counties are made up of census tracts, which are made up of census blocks. That means, in order to get a number that is close to the original count, the census must also tweak the number of Native Americans in each census block and tract; the art of the Census Bureau’s approach is to make all of these close-but-different numbers add up to another close-but-different number.

One might think that protecting people’s privacy is a no-brainer. But some researchers, primarily those whose work depends on the existing data privacy approach, feel differently. These changes, they argue, will make it harder for researchers to do their jobs in practice—while the privacy risks the Census Bureau is protecting against are largely theoretical.

Remember: we’ve shown that the risk is not theoretical. Here’s a bit on how we did it.

We reconstructed a complete list of people under the age of 18 in each census block so that we could learn what their age, sex, race and ethnicity was in 2010. Then we matched this list up with the analogous list in 2020 to find people now 10 years older and with a different reported sex. This method, called a reconstruction-abetted linkage attack, requires only publicly released data sets. When we had it reviewed and presented it formally to the census, it was robust and worrying enough to inspire researchers from Boston University and Harvard University to reach out to us for more details about our work.

We simulated what a bad actor could do, so how do we make sure that attacks like this don’t happen? The Census Bureau is taking this aspect of privacy seriously, and researchers who use these data must not stand in their way.

The census has been collected at great labor and great cost, and we will all benefit from data produced by this effort.  But these data can also do harm, and the Census Bureau’s work to protect privacy has come a long way in mitigating this risk. We must encourage them to continue.

This is an opinion and analysis article, and the views expressed by the author or authors are not necessarily those of Scientific American.

参考译文
人口普查数据如何让跨性别儿童面临风险
每隔十年,美国人口普查局都会统计美国的人口,试图在收集准确信息和保护数据中描述的人的隐私之间保持平衡。但目前的技术可以通过将一个人的社区和年龄等看似匿名的信息联系起来,发现他们在连续的人口普查中所报告的性别不同,从而揭示一个人的跨性别身份。对性别和其他数据去匿名化的能力,可能会给跨性别者和生活在试图将他们定罪的州的家庭带来灾难。在德克萨斯州这样的地方,为变性儿童寻求医疗照顾的家庭可能会被指控虐待儿童,州政府需要知道哪些青少年是变性人,才能开展调查。我们担心,人口普查数据可能被用来使这种调查和惩罚更容易。是否会利用公开发布的数据集匿名化的弱点来寻找变性儿童,并惩罚他们和他们的家人?同样的担忧也突显了2018年公众对人口普查要求人们披露其公民身份的强烈抗议——这些数据将被用来寻找非法居住在美国的人,并对他们进行惩罚。利用我们在数据科学和数据伦理方面的专业知识,我们采用模拟数据,旨在模仿人口普查局公开发布的数据集,试图重新识别跨性别青少年,或者至少缩小他们可能居住的范围,不幸的是,我们成功了。通过人口普查局在2010年使用的数据匿名化方法,我们能够识别605名跨性别儿童。值得庆幸的是,人口普查局正在采取一种新的差异隐私方法,这将从整体上改善隐私,但这仍在进行中。当我们回顾最新公布的数据时,我们发现联邦调查局的新方法将识别率降低了70%——这要好得多,但仍有改进的空间。即使作为研究人员,我们在工作中使用人口普查数据回答有关美国生活的问题,我们也坚信隐私很重要。人口普查局目前正在就2030年人口普查的设计进行公众征求意见期。提交的材料可能会影响人口普查的进行方式,以及人口普查局将如何处理匿名数据。以下是为什么这一点很重要。联邦政府收集人口普查数据来决定国会选区的大小和形状,或者如何分配资金。然而,政府机构并不是唯一使用这些数据的人。经济学和公共卫生等多个领域的研究人员利用公开发布的信息研究国家的状况,并提出政策建议。但去匿名化数据的风险是真实存在的,而且不仅仅是针对跨性别儿童。在一个私人数据收集和访问强大计算系统越来越普遍的世界里,或许有可能解除人口普查局(Census Bureau)在数据中构建的隐私保护。也许最著名的是,计算机科学家拉坦尼亚·斯威尼(Latanya Sweeney)证明,几乎90%的美国公民仅凭邮政编码、出生日期和性别就能被重新识别。 2021年8月,人口普查局做出了回应。该组织使用密码学家首选的差异隐私方法来保护其重新划分的数据。数学家和计算机科学家已经被这种方法的数学优雅所吸引,它包括有意地在关键的人口普查统计中引入控制数量的误差,然后清理结果,以确保它们保持内部一致性。例如,如果人口普查精确地统计了一个特定县的16,147人,它可能会报告一个接近但不同的数字,比如16,171人。这听起来很简单,但县是由人口普查区组成的,而人口普查区是由人口普查区组成的。这意味着,为了得到接近原始统计的数字,人口普查还必须调整每个人口普查区和地区的印第安人数量;人口普查局方法的艺术是把所有这些相近但不同的数字加起来,变成另一个相近但不同的数字。有人可能会认为保护人们的隐私是一件很容易的事。但一些研究人员,主要是那些工作依赖于现有数据隐私保护方法的研究人员,有不同的看法。他们认为,这些变化将使研究人员在实践中更难开展工作,而人口普查局保护的隐私风险主要是理论上的。记住:我们已经证明了风险不是理论上的。下面是我们是如何做到的。我们重建了每个人口普查区18岁以下人口的完整名单,以便了解他们在2010年的年龄、性别、种族和民族。然后,我们将这份名单与2020年的类似名单进行了对比,以找到现在比我们大10岁、报告性别不同的人。这种方法称为重构辅助链接攻击,只需要公开发布的数据集。当我们对它进行审查并正式提交给人口普查时,它的强大和令人担忧足以激励波士顿大学和哈佛大学的研究人员联系我们,了解更多关于我们工作的细节。我们模拟了一个坏人会做什么,那么我们如何确保这样的攻击不会发生呢?人口普查局正在认真对待这方面的隐私问题,使用这些数据的研究人员不能妨碍他们。人口普查的收集花费了巨大的人力和成本,我们都将从这一努力产生的数据中受益。但这些数据也可能造成伤害,人口普查局保护隐私的工作在降低这种风险方面取得了长足进展。我们必须鼓励他们继续下去。这是一篇观点分析文章,作者或作者所表达的观点不一定是《科学美国人》的观点。
  • en
  • 人口普查
您觉得本篇内容如何
评分

评论

您需要登录才可以回复|注册

提交评论

提取码
复制提取码
点击跳转至百度网盘