Methods for extracting data from the Internet

The advent of the Internet has yielded exciting new opportunities for the collection of large amounts of structured and unstructured social scientific data. This thesis describes two such methods for harvesting data from websites and web services: web-scraping and connecting to an application progra...

Full description

Saved in:  
Bibliographic Details
Main Author: Willers, Joel (Author)
Format: Electronic Book
Language:English
Published: 2017
In:Year: 2017
Online Access: Volltext (kostenfrei)
Check availability: HBZ Gateway

MARC

LEADER 00000nam a22000002 4500
001 1866148826
003 DE-627
005 20231018043713.0
007 cr uuu---uuuuu
008 231018s2017 xx |||||o 00| ||eng c
035 |a (DE-627)1866148826 
035 |a (DE-599)KXP1866148826 
040 |a DE-627  |b ger  |c DE-627  |e rda 
041 |a eng 
084 |a 2,1  |2 ssgn 
100 1 |a Willers, Joel  |e VerfasserIn  |4 aut 
245 1 0 |a Methods for extracting data from the Internet 
264 1 |c 2017 
336 |a Text  |b txt  |2 rdacontent 
337 |a Computermedien  |b c  |2 rdamedia 
338 |a Online-Ressource  |b cr  |2 rdacarrier 
520 |a The advent of the Internet has yielded exciting new opportunities for the collection of large amounts of structured and unstructured social scientific data. This thesis describes two such methods for harvesting data from websites and web services: web-scraping and connecting to an application programming interface (API). I describe the development and implementation of tools for each of these methods. In my review of the two related, yet distinct data collection methods, I provide concrete examples of each. To illustrate the first method, ‘scraping’ data from publicly available data repositories (specifically the Google Books Ngram Corpus), I developed a tool and made it available to the public on a web site. The Google Books Ngram Corpus contains groups of words used in millions of books that were digitized and catalogued. The corpus has been made available for public use, but in current form, accessing the data is tedious, time consuming and error prone. For the second method, utilizing an API from a web service (specifically the Twitter Streaming API), I used a code library and the R programming language to develop a program that connects to the Twitter API to collect public posts known as tweets. I review prior studies that have used these data, after which, I report results from a case study involving references to countries. The relative prestige of nations are compared based on the frequency of mentions in English literature and mentions in tweets 
856 4 0 |u https://core.ac.uk/download/141671065.pdf  |x Verlag  |z kostenfrei  |3 Volltext 
912 |a NOMM 
935 |a mkri 
951 |a BO 
ELC |a 1 
LOK |0 000 xxxxxcx a22 zn 4500 
LOK |0 001 4391829584 
LOK |0 003 DE-627 
LOK |0 004 1866148826 
LOK |0 005 20231018043713 
LOK |0 008 231018||||||||||||||||ger||||||| 
LOK |0 035   |a (DE-2619)CORE46034756 
LOK |0 040   |a DE-2619  |c DE-627  |d DE-2619 
LOK |0 092   |o n 
LOK |0 852   |a DE-2619 
LOK |0 852 1  |9 00 
LOK |0 935   |a core 
OAS |a 1 
ORI |a SA-MARC-krimdoka001.raw