小程序
传感搜
传感圈

How Web Scraping Brings Freedom to Research

2022-08-04
关注

There are several stages to any academic research project, most of which differ depending on the hypothesis and methodology. Few disciplines, however, can completely avoid the data collection step. Even in qualitative research, some data has to be collected.

Unfortunately, the one unavoidable step is also the most complicated one. Good, high-quality research necessitates a ton of carefully selected (and often randomized) data. Getting all of it takes an enormous amount of time. In fact, it's likely the most time-consuming step out of the entire research project, regardless of discipline.

Four primary methods are employed when data has to be collected for research. Each of these comes with numerous drawbacks, however, some are especially troublesome:

Related: Website Scraping Is an Easy Growth Hack You Should Try

Manual data collection

One of the most tried-and-true methods is the manual collection. It's almost a foolproof method, as the researcher gets to have complete control over the process. Unfortunately, it's also the slowest and most time-consuming practice out of them all.

Additionally, manual data collection runs into issues of randomization (if required) as sometimes it might be nigh impossible to induce fairness into the set without requiring even more effort than initially planned.

Finally, manual data collection still requires cleaning and maintenance. There's too much room for possible error, especially when extremely large swaths of information need to be collected. In many cases, the collection process is not even performed by a single person, so everything needs to be normalized and equalized.

Existing public or research databases

Some universities purchase large datasets for research purposes and make them available to the student body and other employees. Additionally, due to existing data laws in some countries, governments publish censuses and other information yearly for public consumption.

While these are generally great, there are a few drawbacks. For one, university purchases of databases are led by the research intent and grants. A single researcher is unlikely to convince the financial department to get them the data they need from a vendor, as there might not be sufficient ROI to do so.

Additionally, if everyone is acquiring their data from a single source, that can cause uniqueness and novelty issues. There's a theoretical limit to the insights that can be extracted from a single database, unless it's continually renewed and new sources are added. Even then, many researchers working with a single source might unintentionally skew results.

Finally, having no control over the collection process might also skew the results, especially if data is acquired through third-party vendors. Data might be collected without having research purposes in mind, so it could be biased or only reflect a small piece of the puzzle.

Related: Using Alternative Data for Short-Term Forecasts

Getting data from companies

Businesses have begun working closer with universities nowadays. Now, many companies, including Oxylabs, have developed partnerships with numerous universities. Some businesses offer grants. Others provide tools or even entire datasets.

All of these types of partnerships are great. However, I firmly believe that providing only the tools and solutions for data acquisition is the correct decision, with grants being a close second. Datasets are unlikely to be that useful for universities for several reasons.

First, unless the company extracts data for that particular research alone, there may be issues with applicability. Businesses will collect data that's necessary for their operations and not much else. It may accidentally be useful to other parties, but it might not always be the case.

Additionally, just as with existing databases, these collections might be biased or have other issues to do with fairness. These issues might not be as apparent in business decision-making,but could be critical in academic research.

Finally, not all businesses will give away data with no strings attached. While there may be necessary precautions that have to be taken, especially if the data is sensitive, some organizations will want to see the results of the study.

Even without any ill intentions from the organization, outcome reporting bias could become an issue. Non-results or bad results could be seen as disappointing and even damaging to the partnership, which would unintentionally skew research.

Moving on to grants, there are some known issues with them as well. However, they are not as pressing. As long as studies are not completely funded by a company in a field in which it is involved, publishing biases are less likely to occur.

In the end, providing the infrastructure that will allow researchers to gather data without any overhead, other than the necessary precautions, is the least susceptible to biases and other publishing issues.

Related: Once Only for Huge Companies, 'Web Scraping' Is Now an Online Arms Race No Internet Marketer Can Avoid

Enter web scraping

Continuing off my previous thought, one of the best solutions that a business can provide researchers with is web scraping. After all, it's a process that enables automated data collection (in either raw or parsed formats) from many disparate sources.

Creating web scraping solutions, however, takes an enormous amount of time, even if the necessary knowledge is already in place. So, while the benefits for research might be great, there's rarely a good reason for someone in academia to get involved in such an undertaking.

Such an undertaking is time-consuming and difficult even if we discount all the other pieces of the puzzle — proxy acquisition, CAPTCHA solving and many other roadblocks. As such, companies can provide access to the solutions to allow researchers to skip through the difficulties.

Building up web scrapers, however, would not be essential if the solutions wouldn't play an important part in the freedom of research. With all the other cases I've outlined above (outside of manual collection), there's always the risk of bias and publication issues. Additionally, researchers are then always limited by one or other factors, such as the volume or selection of data.

With web scraping, however, none of these issues occur. Researchers are free to acquire any data they need and specialize it according to the study they are conducting. The organizations involved with the provision of web scraping also have no skin in the game, so there's no reason for bias to appear.

Finally, as so many sources are available, the doors are wide open to conduct interesting and unique research that otherwise would be impossible. It's almost like having an infinitely large dataset that can be updated with nearly any information at any time.

In the end, web scraping is what will allow academia and researchers to enter a new age of data acquisition. It will not only ease the most expensive and complicated process of research, but it will also enable them to break off from the conventional issues that come with acquiring data from third parties.

For those in academia who want to enter the future earlier than others, Oxylabs is willing to join hands in helping researchers with the pro bono provisions of our web scraping solutions.

参考译文
网页抓取如何为研究带来自由
任何学术研究项目都有几个阶段,大部分阶段因假设和方法的不同而不同。然而,很少有规程能够完全避免数据收集步骤。即使在定性研究中,也需要收集一些数据。不幸的是,不可避免的一步也是最复杂的一步。好的、高质量的研究需要大量精心挑选的(通常是随机的)数据。所有这些都需要花费大量的时间。事实上,不管学科如何,这可能是整个研究项目中最耗时的一步。当需要收集数据进行研究时,主要采用四种方法。每一种方法都有许多缺点,然而,有些特别麻烦:相关:网站抓取是一种容易的增长黑客你应该尝试一种最可靠的方法是手动收集。这几乎是一种万无一失的方法,因为研究人员可以完全控制整个过程。不幸的是,这也是所有练习中最慢和最耗时的。此外,手动数据收集还会遇到随机化的问题(如果需要的话),因为有时候在不需要比最初计划更多努力的情况下,几乎不可能将公平性引入集合中。最后,手工采集数据仍然需要清洗和维护。可能出现错误的空间太大了,尤其是当需要收集非常大的信息时。在许多情况下,收集过程甚至不是由一个人执行的,因此一切都需要标准化和均衡。一些大学出于研究目的购买大型数据集,并将其提供给学生团体和其他员工。此外,由于一些国家现有的数据法,政府每年公布人口普查和其他信息供公众消费。虽然这些都很好,但也有一些缺点。首先,大学购买数据库是由研究目的和经费决定的。单个研究人员不太可能说服财务部门从供应商那里获得所需的数据,因为这样做可能没有足够的ROI。此外,如果每个人都从单一来源获取数据,这可能会导致独特性和新颖性问题。从单个数据库中提取的见解有一个理论上的限制,除非它不断更新并添加新的来源。即便如此,许多使用单一来源的研究人员可能会无意中歪曲结果。最后,不能控制收集过程也可能会影响结果,特别是如果数据是通过第三方供应商获得的。数据的收集可能没有考虑到研究目的,所以它可能是有偏见的,或只反映了拼图的一小部分。相关:使用替代数据进行短期预测如今,企业已经开始与大学进行更密切的合作。现在,包括Oxylabs在内的许多公司都与许多大学建立了合作关系。一些企业提供补助。另一些则提供工具甚至整个数据集。所有这些类型的合作都很好。然而,我坚信,只提供数据获取的工具和解决方案是正确的决定,其次是拨款。数据集不太可能对大学有用,原因有几个。首先,除非该公司仅为该特定研究提取数据,否则可能存在适用性问题。企业只会收集对他们的运营有必要的数据,而不会收集其他太多的数据。它可能偶尔对其他方有用,但也可能不总是这样。此外,就像现有数据库一样,这些集合可能会有偏差,或与公平性有关的其他问题。这些问题在商业决策中可能不那么明显,但在学术研究中可能至关重要。 最后,并不是所有的企业都会无条件地提供数据。虽然可能需要采取必要的预防措施,特别是如果数据是敏感的,一些组织将希望看到研究的结果。即使组织没有任何不良意图,结果报告的偏见也可能成为一个问题。没有结果或糟糕的结果可能会被视为令人失望,甚至损害合作关系,这将在无意中扭曲研究。接下来是补助金,它们也有一些已知的问题。然而,它们并不那么紧迫。只要研究不是完全由一家公司在其参与的领域提供资金,发表的偏见就不太可能发生。最后,除了必要的预防措施外,提供能够让研究人员在没有任何开销的情况下收集数据的基础设施,是最不容易受到偏见和其他出版问题的影响的。相关:Once Only for Huge Companies, '现在是一场网络军备竞赛没有网络营销者可以避免继续我之前的想法,一个企业可以提供给研究人员的最好的解决方案之一是网页抓取。毕竟,它是一个允许从许多不同来源自动收集(原始或解析格式)数据的过程。然而,创建web抓取解决方案需要大量的时间,即使必要的知识已经到位。因此,虽然研究的好处可能是巨大的,但学术界的人很少有好的理由参与这样的事业。即使我们不考虑其他问题——代理收购、验证码解决和许多其他障碍——这样的任务也是耗时且困难的。因此,公司可以提供解决方案,让研究人员跳过困难。然而,如果解决方案不能在研究自由中发挥重要作用,那么建立web scraper就不是必要的。对于我在上面概述的所有其他案例(手工收集之外),总是存在偏见和出版问题的风险。此外,研究人员总是受到一个或其他因素的限制,如数据的数量或选择。然而,对于web抓取,这些问题都不会发生。研究人员可以自由获取他们需要的任何数据,并根据他们正在进行的研究进行专门研究。参与提供网页抓取的组织也与游戏无关,所以没有理由出现偏见。最后,由于有这么多可用的资源,进行有趣和独特的研究的大门是敞开的,否则是不可能的。这几乎就像拥有一个无限大的数据集,可以在任何时间用几乎任何信息更新。最终,网络抓取将使学术界和研究人员进入一个数据获取的新时代。这不仅将简化最昂贵和复杂的研究过程,而且还将使他们摆脱从第三方获取数据所带来的传统问题。对于那些想要比其他人更早进入未来的学术界人士,Oxylabs愿意携起手来,通过无偿提供我们的网页抓取解决方案来帮助研究人员。
您觉得本篇内容如何
评分

评论

您需要登录才可以回复|注册

提交评论

entrepreneur

这家伙很懒,什么描述也没留下

关注

点击进入下一篇

如何投资无人机行业:深入研究无人机ETF

提取码
复制提取码
点击跳转至百度网盘