| |
| |
Generalizability of peer-to-peer file sharing paper results
| Views: 591 Created: 18-03-2010 18:00 Last Updated: 29-03-2010 04:37 |
|
|
Since the publication our paper on the inadvertent disclosure of health information on peer-to-peer file sharing networks (see here on the JAMIA web site: http://jamia.bmj.com/content/17/2/148.short?q=w_jamia_current_tab; henceforth the "p2p paper"), a number of colleagues representing the academic community, government, and regulators have asked us many questions about the paper. The questions pertain to providing more details about how the study was conducted and how to interpret the results. We will try to address some of these questions in this knowledgebase.
The first set of questions were about the generalizability of the results over geography and time. There are many facets to this question, and therefore we present below a number of considerations:
- We cannot make claims about the size of the inadvertent disclosure problem beyond the US and Canada as these were the only regions we examined. Although inadvertent disclosure of PHI and PFI is likely occurring in Europe and Asia as well, the extent of the problem may be different from what we had determined in the p2p paper.
- Our data collection was performed in 2008. Some of the p2p client vendors have made changes to their client software since then to ostensibly improve the privacy protections (among other feature additions). These privacy protective changes were intended to reduce the opportunity for end-users to inadvertently disclose personal information. Our data collection would not reflect the potential benefits of such changes. It takes time for a new version of a p2p client to be disseminated, adopted and to replace older versions. However, over time, improvements to protect privacy would be expected to reduce the number of machines that are disclosing PHI and PFI. But we do not have at this point any data to support that expectation.
- It is also reasonable to expect that more health information will be digitized over time. Therefore, we expect that the amount of PHI files that are being inadvertently disclosed would also increase over time, and since we did our data collection in 2008 this means that there would potentially be more PHI files on p2p networks now. Whether this increase balances out the protection afforded by more privacy friendly p2p client tools is an open question that would be interesting to track over time.
- Recent studies have noted that the proportion of p2p traffic of total Internet traffic has been decreasing over time (see http://www.wired.com/epicenter/2009/10/p2p-dying/ which summarizes a study from Arbor Networks and this summary from Cisco http://www.cisco.com/en/US/solutions/collateral/ns341/ns525/ns537/ns705/Cisco_VNI_Usage_WP.html). This phenomenon would reduce the relative pool of IP addresses that can potentially inadvertently disclose personal information. To the extent that this is the case, such a decrease can have an impact on the numerator and denominator in our prevalence estimates. Therefore, coupled with the other two effects noted above it is not clear what the long term trend of the inadvertent disclosure of PHI and PFI through p2p clients would be as a prevalence.
- As you will notice in our p2p paper, our actual data collected was approximately double the sample size estimates. The reason is that we collected the data at two different points in time in 2008. The prevalence estimates were almost identical in both cases and therefore we decided to pool the data, which would also reduce the width of the confidence intervals. Therefore, the estimates were quite stable during the 4-6 months period over which we collected data.
- Because many ISPs in Canada throttle p2p traffic, we ran our data collection applications on multiple ISPs to maximize our ability to finish the study during a reasonable amount of time. Therefore, to the extent that this postiviely impacts generalizability, we used different end points within Canada to collect the data.
- The way searches work in the p2p networks we examined means that our searches propagate from one superpeer to another. If the search runs long enough then in theory it would have covered a large proportion of the network. In our study the queries were running for weeks. Therefore, we have reason to believe that we had wide coverage of the network.
- The denominator that we used to compute the percentages did not include the IP addresses that had viruses only. Admittedly this would result in a small adjustment had we included them, but viruses are arguably not proper documents and therefore a case can be made for not including them as well, as we did.
- There is some research on how to create random samples from p2p networks. However, these methods require control of the nodes in the networks so that the probability of forwarding as request to a particular other node can be set. In practice it is not possible for us to control these probabilities in the network. This work is interesting and is powerful in simulated settings, but thus far would not be applicable when collecting data from a real p2p network.
If you have any additional questions about generalizability, please post a comment or send us an email.
The author(s) retain all copyright to this knowledgebase article. Please include a citation to the web page if you reuse this material. More information is available at our lab web site: http://www.ehealthinformation.ca/.
|
| |
|