Methods for extracting data from the Internet

Willers, Joel

Methods for extracting data from the Internet

The advent of the Internet has yielded exciting new opportunities for the collection of large amounts of structured and unstructured social scientific data. This thesis describes two such methods for harvesting data from websites and web services: web-scraping and connecting to an application progra...

Full description

Saved in:

Bibliographic Details
Main Author:	Willers, Joel (Author)
Format:	Electronic Book
Language:	English
Published:	2017
In:	Year: 2017
Online Access:	Volltext (kostenfrei)
Check availability:	HBZ Gateway

MARC


LEADER	00000nam a22000002 4500
001	1866148826
003	DE-627
005	20231018043713.0
007	cr uuu---uuuuu
008	231018s2017 xx \|\|\|\|\|o 00\| \|\|eng c
035			\|a (DE-627)1866148826
035			\|a (DE-599)KXP1866148826
040			\|a DE-627 \|b ger \|c DE-627 \|e rda
041			\|a eng
084			\|a 2,1 \|2 ssgn
100	1		\|a Willers, Joel \|e VerfasserIn \|4 aut
245	1	0	\|a Methods for extracting data from the Internet
264		1	\|c 2017
336			\|a Text \|b txt \|2 rdacontent
337			\|a Computermedien \|b c \|2 rdamedia
338			\|a Online-Ressource \|b cr \|2 rdacarrier
520			\|a The advent of the Internet has yielded exciting new opportunities for the collection of large amounts of structured and unstructured social scientific data. This thesis describes two such methods for harvesting data from websites and web services: web-scraping and connecting to an application programming interface (API). I describe the development and implementation of tools for each of these methods. In my review of the two related, yet distinct data collection methods, I provide concrete examples of each. To illustrate the first method, ‘scraping’ data from publicly available data repositories (specifically the Google Books Ngram Corpus), I developed a tool and made it available to the public on a web site. The Google Books Ngram Corpus contains groups of words used in millions of books that were digitized and catalogued. The corpus has been made available for public use, but in current form, accessing the data is tedious, time consuming and error prone. For the second method, utilizing an API from a web service (specifically the Twitter Streaming API), I used a code library and the R programming language to develop a program that connects to the Twitter API to collect public posts known as tweets. I review prior studies that have used these data, after which, I report results from a case study involving references to countries. The relative prestige of nations are compared based on the frequency of mentions in English literature and mentions in tweets
856	4	0	\|u https://core.ac.uk/download/141671065.pdf \|x Verlag \|z kostenfrei \|3 Volltext
912			\|a NOMM
935			\|a mkri
951			\|a BO
ELC			\|a 1
LOK			\|0 000 xxxxxcx a22 zn 4500
LOK			\|0 001 4391829584
LOK			\|0 003 DE-627
LOK			\|0 004 1866148826
LOK			\|0 005 20231018043713
LOK			\|0 008 231018\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|ger\|\|\|\|\|\|\|
LOK			\|0 035 \|a (DE-2619)CORE46034756
LOK			\|0 040 \|a DE-2619 \|c DE-627 \|d DE-2619
LOK			\|0 092 \|o n
LOK			\|0 852 \|a DE-2619
LOK			\|0 852 1 \|9 00
LOK			\|0 935 \|a core
OAS			\|a 1
ORI			\|a SA-MARC-krimdoka001.raw