Temat: Web-scraping - Katalog OPAC zbiorów

Skocz do pozycji: 1.

Tytuł:: Improving the credibility of the extracted position from a vast collection of job offers with machine learning ensemble methods
Autorzy:: Drozda, Paweł
Ropiak, Krzysztof
Nowak, Bartosz A.
Talun, Arkadiusz
Osowski, Maciej
Powiązania:: https://bibliotekanauki.pl/articles/22615539.pdf
Data publikacji:: 2023
Wydawca:: Uniwersytet Warmińsko-Mazurski w Olsztynie
Tematy:: machine learning
web scraping
granularity method
classification
Opis:: The main aim of this paper is to evaluate crawlers collecting the job offers from websites. In particular the research is focused on checking the effectiveness of ensemble machine learning methods for the validity of extracted position from the job ads. Moreover, in order to significantly reduce the training time of the algorithms (Random Forests and XGBoost), granularity methods were also tested to significantly reduce the input training dataset. Both methods achieved satisfactory results in accuracy and F1 measures, which exceeded 96%. In addition, granulation reduced the input dataset by more than 99%, and the results obtained were only slightly worse (accuracy between 1% and 5%, F1 between 3% and 8%). Thus, it can be concluded that the considered methods can be used in the evaluation of job web crawlers.
Źródło:: Technical Sciences / University of Warmia and Mazury in Olsztyn; 2023, 26(1); 125--140
1505-4675
2083-4527
Pojawia się w:: Technical Sciences / University of Warmia and Mazury in Olsztyn
Dostawca treści:: Biblioteka Nauki

Artykuł

Zmień widok

na półce

Skocz do pozycji: 2.

Tytuł:: Current challenges and possible big data solutions for the use of web data as a source for official statistics
Współczesne wyzwania i możliwości w zakresie stosowania narzędzi big data do uzyskania danych webowych jako źródła dla statystyki publicznej
Autorzy:: Daas, Piet
Maślankowski, Jacek
Powiązania:: https://bibliotekanauki.pl/articles/31232088.pdf
Data publikacji:: 2023-12-29
Wydawca:: Główny Urząd Statystyczny
Tematy:: big data
web data
websites
web scraping
dane webowe
strony internetowe
Opis:: Web scraping has become popular in scientific research, especially in statistics. Preparing an appropriate IT environment for web scraping is currently not difficult and can be done relatively quickly. Extracting data in this way requires only basic IT skills. This has resulted in the increased use of this type of data, widely referred to as big data, in official statistics. Over the past decade, much work was done in this area both on the national level within the national statistical institutes, and on the international one by Eurostat. The aim of this paper is to present and discuss current problems related to accessing, extracting, and using information from websites, along with the suggested potential solutions. For the sake of the analysis, a case study featuring large-scale web scraping performed in 2022 by means of big data tools is presented in the paper. The results from the case study, conducted on a total population of approximately 503,700 websites, demonstrate that it is not possible to provide reliable data on the basis of such a large sample, as typically up to 20% of the websites might not be accessible at the time of the survey. What is more, it is not possible to know the exact number of active websites in particular countries, due to the dynamic nature of the Internet, which causes websites to continuously change.
Web scraping jest coraz popularniejszy w badaniach naukowych, zwłaszcza w dziedzinie statystyki. Przygotowanie środowiska do scrapowania danych nie przysparza obecnie trudności i może być wykonane relatywnie szybko, a uzyskiwanie informacji w ten sposób wymaga jedynie podstawowych umiejętności cyfrowych. Dzięki temu statystyka publiczna w coraz większym stopniu korzysta z dużych wolumenów danych, czyli big data. W drugiej dekadzie XXI w. zarówno krajowe urzędy statystyczne, jak i Eurostat włożyły dużo pracy w doskonalenie narzędzi big data. Nadal istnieją jednak trudności związane z dostępnością, ekstrakcją i wykorzystywaniem informacji pobranych ze stron internetowych. Tym problemom oraz potencjalnym sposobom ich rozwiązania został poświęcony niniejszy artykuł. Omówiono studium przypadku masowego web scrapingu wykonanego w 2022 r. za pomocą narzędzi big data na próbie 503 700 stron internetowych. Z analizy wynika, że dostarczenie wiarygodnych danych na podstawie tak dużej próby jest niemożliwe, ponieważ w czasie badania zwykle do 20% stron internetowych może być niedostępnych. Co więcej, dokładna liczba aktywnych stron internetowych w poszczególnych krajach nie jest znana ze względu na dynamiczny charakter Internetu, skutkujący ciągłymi zmianami stron internetowych.
Źródło:: Wiadomości Statystyczne. The Polish Statistician; 2023, 68, 12; 49-64
0043-518X
Pojawia się w:: Wiadomości Statystyczne. The Polish Statistician
Dostawca treści:: Biblioteka Nauki

Artykuł

Zmień widok

na półce

Skocz do pozycji: 3.

Tytuł:: The use of web-scraped data to analyze the dynamics of footwear prices
Autorzy:: Juszczak, Adam
Powiązania:: https://bibliotekanauki.pl/articles/2027264.pdf
Data publikacji:: 2021
Wydawca:: Uniwersytet Ekonomiczny w Katowicach
Tematy:: Big data
Consumer Price Index
Inflation
Online shopping
Web-scraping
Opis:: Aim/purpose – Web-scraping is a technique used to automatically extract data from websites. After the rise-up of online shopping, it allows the acquisition of information about prices of goods sold by retailers such as supermarkets or internet shops. This study examines the possibility of using web-scrapped data from one clothing store. It aims at comparing known price index formulas being implemented to the web-scraping case and verifying their sensitivity on the choice of data filter type. Design/methodology/approach – The author uses the price data scrapped from one of the biggest online shops in Poland. The data were obtained as part of eCPI (electronic Consumer Price Index) project conducted by the National Bank of Poland. The author decided to select three types of products for this analysis – female ballerinas, male shoes, and male oxfords to compare their prices in over one-year time period. Six price indexes were used for calculation – The Jevons and Dutot indexes with their chain and GEKS (acronym from the names of creators – Gini–Éltető–Köves–Szulc) versions. Apart from the analysis conducted on a full data set, the author introduced filters to remove outliers. Findings – Clothing and footwear are considered one of the most difficult groups of goods to measure price change indexes due to high product churn, which undermines the possibility to use the traditional Jevons and Dutot indexes. However, it is possible to use chained indexes and GEKS indexes instead. Still, these indexes are fairly sensitive to large price changes. As observed in case of both product groups, the results provided by the GEKS and chained versions of indexes were different, which could lead to conclusion that even though they are lending promising results, they could be better suited for other COICOP (Classification of Individual Consumption by Purpose) groups. Research implications/limitations – The findings of the paper showed that usage of filters did not significantly reduce the difference between price indexes based on GEKS and chain formulas. Originality/value/contribution – The usage of web-scrapped data is a fairly new topic in the literature. Research on the possibility of using different price indexes provides useful insights for future usage of these data by statistics offices.
Źródło:: Journal of Economics and Management; 2021, 43; 251-269
1732-1948
Pojawia się w:: Journal of Economics and Management
Dostawca treści:: Biblioteka Nauki

Artykuł

Zmień widok

na półce

Skocz do pozycji: 4.

Tytuł:: The use of web-scraped data to analyse the dynamics of clothing and footwear prices
Wykorzystanie danych scrapowanych do analizy dynamiki cen odzieży i obuwia
Autorzy:: Juszczak, Adam
Powiązania:: https://bibliotekanauki.pl/articles/28408209.pdf
Data publikacji:: 2023-09-29
Wydawca:: Główny Urząd Statystyczny
Tematy:: inflation
web scraping
online shopping
GEKS-J
inflacja
zakupy online
Opis:: Web scraping is a technique that makes it possible to obtain information from websites automatically. As online shopping grows in popularity, it became an abundant source of information on the prices of goods sold by retailers. The use of scraped data usually allows, in addition to a significant reduction of costs of price research, the improvement of the precision of inflation estimates and real-time tracking. For this reason, web scraping is a popular research tool both for statistical centers (Eurostat, British Office of National Statistics, Belgian Statbel) and universities (e.g. the Billion Prices Project conducted at Massachusetts Institute of Technology). However, the use of scraped data to calculate inflation brings about many challenges at the stage of their collection, processing, and aggregation. The aim of the study is to compare various methods of calculating price indices of clothing and footwear on the basis of scraped data. Using data from one of the largest online stores selling clothing and footwear for the period of February 2018–November 2019, the author compared the results of the Jevons chain index, the GEKS-J index and the GEKS-J expanding and updating window methods. As a result of the calculations, a high chain index drift was confirmed, and very similar results were found using the extension methods and the updated calculation window (excluding the FBEW method).
Web scraping to technika pozwalająca automatycznie pobierać informacje zamieszczone na stronach internetowych. Wraz ze wzrostem popularności zakupów online stała się ona ważnym źródłem informacji o cenach dóbr sprzedawanych przez detalistów. Wykorzystanie danych scrapowanych na ogół nie tylko pozwala znacząco obniżyć koszty badania cen, lecz także poprawia precyzję szacunków inflacji i umożliwia śledzenie jej w czasie rzeczywistym. Z tego względu web scraping jest dziś popularną techniką badań prowadzonych zarówno w ośrodkach statystycznych (Eurostat, brytyjski Office of National Statistics, belgijski Statbel), jak i na uniwersytetach (m.in. Billion Prices Project realizowany na Massachusetts Institute of Technology). Zastosowanie danych scrapowanych do obliczania inflacji wiąże się jednak z wieloma wyzwaniami na poziomie ich zbierania, przetwarzania oraz agregacji. Celem badania omawianego w artykule jest porównanie różnych metod obliczania indeksów cen odzieży i obuwia wykorzystujących dane scrapowane. Na podstawie danych z jednego z największych sklepów internetowych zajmujących się sprzedażą odzieży i obuwia za okres od lutego 2018 r. do listopada 2019 r. porównano wyniki indeksu łańcuchowego Jevonsa, indeksu GEKS-J oraz indeksów GEKS-J z użyciem metod rozszerzenia i aktualizowania okna obliczeń. Potwierdzono wysokie obciążenie dryfem łańcuchowym, a ponadto stwierdzono bardzo podobne wyniki przy użyciu metod rozszerzenia i aktualizowania okna obliczeń (z wyłączeniem metody FBEW).
Źródło:: Wiadomości Statystyczne. The Polish Statistician; 2023, 68, 9; 15-33
0043-518X
Pojawia się w:: Wiadomości Statystyczne. The Polish Statistician
Dostawca treści:: Biblioteka Nauki

Artykuł

Zmień widok

na półce

Skocz do pozycji: 5.

Tytuł:: Applying intelligent techniques for talent recruitment
Autorzy:: Flores-Hernández, Isaac
Bonilla-Huerta, Edmundo
Quintero-Flores, Perfecto Malaquías
Ponce, Oscar Atriano
Hernández-Hernández, José Crispín
Powiązania:: https://bibliotekanauki.pl/articles/117942.pdf
Data publikacji:: 2019
Wydawca:: Polskie Towarzystwo Promocji Wiedzy
Tematy:: fuzzy logic
web scraping
personnel selection
AHP
logika rozmyta
selekcja personelu
Opis:: The objective of this research is to describe a system to aligned the hard and soft skills of the applicant to the current labor market. For this, a system was implemented which uses Web Scraping to get a general profile of an area, meanwhile for the evaluation of the applicant soft skills is used a Test Cleaver and for the hard skills fuzzy inference system is implemented. Therefore, the data is entered into an Analytic Hierarchy Process, with this, the applier is able to see which area is better to improve according to the hard and soft skills.
Źródło:: Applied Computer Science; 2019, 15, 2; 63-72
1895-3735
Pojawia się w:: Applied Computer Science
Dostawca treści:: Biblioteka Nauki

Artykuł

Zmień widok

na półce

Skocz do pozycji: 6.

Tytuł:: What Drives Price Dispersion in the European E-commerce Industry?
Autorzy:: Gyódi, Kristóf
Sobolewski, Maciej
Ziembiński, Michał
Powiązania:: https://bibliotekanauki.pl/articles/1357380.pdf
Data publikacji:: 2018-12-18
Wydawca:: Uniwersytet Warszawski. Wydział Nauk Ekonomicznych
Tematy:: price dispersion
price discrimination
e-commerce
web-scraping
price comparison service
Opis:: An important aspect of economic integration of the European Union is price convergence on digital single market. In this study, we propose a novel way to measure price dispersion in the e-commerce industry, using a custom made web-scraping tool. We target all the major price comparisons sites in the 26 EU member states, which enables us to collect price signals from thousands of retail shops operating on-line. We analyse pricing data of 182 branded products sold on-line across the EU, representing the most popular categories: fashion, consumer electronics, gaming and software, and cosmetics. We find considerable dispersion of both pre and post-vat on-line prices ranging from 20% to 40%, depending on the product category. The observed on-line price dispersion is driven by both cost factors and the level of per capita income, which is consistent with the view that producers or large distributors might engage in strategic price discrimination induced by income heterogeneity.
Źródło:: Central European Economic Journal; 2017, 3, 50; 53 - 71
2543-6821
Pojawia się w:: Central European Economic Journal
Dostawca treści:: Biblioteka Nauki

Artykuł

Zmień widok

na półce

Skocz do pozycji: 7.

Tytuł:: Web data scraping for digital public relations analysis based on the example of companies installing photovoltaic systems
Autorzy:: Zdonek, Dariusz
Powiązania:: https://bibliotekanauki.pl/articles/27313485.pdf
Data publikacji:: 2022
Wydawca:: Politechnika Śląska. Wydawnictwo Politechniki Śląskiej
Tematy:: digital public relations
Polska
cities
photovoltaics
web scraping
cyfrowe public relations
Polska
miasta
fotowoltaika
Opis:: Purpose: The first objective of this article was an attempt at identifying the major differences between such terms as public relations (PR), digital public relations (DPR) and digital marketing (DM). The second objective was to employ selected web data scraping techniques to analyse DPR of service providers installing photovoltaic systems. Design/methodology/approach: The first objective of this article was achieved by analysing reference works. To achieve the second objective, the author used MS Excel, web scraping and proprietary computer scripts in R and Python. In this way, selected details were obtained from the companies catalogue at panoramafirm.pl and Google search engine, and then the received results were compared and analysed. What is more, the results from Google search engine were obtained and analysed for 964 towns and cities entered in the engine with the “photovoltaics” phrase. Findings: 50 thousand URLs were obtained and 1,755 unique website domain addresses were extracted. Analysing the content of websites at the obtained Internet domains, 6 major categories of websites were identified, which appeared in the first 10 search results for the photovoltaic-related queries. These are: Company Websites (CW), Blog Websites (BW), Announcement Services (AS), SEO Landing Pages (SLP), Public Announcement Pages (PAP) and Social Media Page (SMP). Each of these categories is characterised briefly and a few examples are provided for each of them. Research limitations/implications: The limitations of this article include the focus on one companies catalogue, i.e., panoramafirm.pl, and the results from Google search engine solely for the Polish language. Moreover, only the results of the first 10 links from Google engine for the single “photovoltaics” phrase and town/city name were taken into consideration. Originality/value: This article has a theoretical and practical value. The analysis allowed to identify six categories of websites, which may be analysed with respect to digital public relations in the area of photovoltaic system installation. The most important of them are the websites belonging to the Company Website (CW) and Social Media Page (SMP) types. This article is addressed to anyone interested in obtaining data from the Internet using web scraping technique and data analysis in the area of digital public relations (DPR).
Źródło:: Zeszyty Naukowe. Organizacja i Zarządzanie / Politechnika Śląska; 2022, 161; 365--380
1641-3466
Pojawia się w:: Zeszyty Naukowe. Organizacja i Zarządzanie / Politechnika Śląska
Dostawca treści:: Biblioteka Nauki

Artykuł

Zmień widok

na półce

Skocz do pozycji: 8.

Tytuł:: The evaluation of (big) data integration methods in tourism
Ocena metod integracji danych dotyczących turystyki z uwzględnieniem big data
Autorzy:: Cierpiał-Wolan, Marek
Stateva, Galya
Powiązania:: https://bibliotekanauki.pl/articles/31232009.pdf
Data publikacji:: 2023-12-29
Wydawca:: Główny Urząd Statystyczny
Tematy:: data integration methods
tourism survey frame
web scraping
metody integracji danych
operat do badań turystyki
Opis:: In view of many dynamic changes taking place in the modern world due to the pandemic, the migration crisis, armed conflicts, etc., it is a huge challenge for official statistics to provide good-quality information, which should be available almost in real time. In this context, integration of data from multiple sources, in particular big data, is a prerequisite. The aim of the article is to characterise and evaluate the following selected methods of data integration in tourism statistics: Natural Language Processing (NLP), machine learning algorithm, i.e. K-Nearest Neighbours (K-NN) using TF-IDF and N-gram techniques, and Fuzzy Matching, belonging to probabilistic methods. In tourism surveys, data acquired using web scraping deserve special attention. For this reason, the analysed methods were used to combine data from booking portals (Booking.com, Hotels.com and Airbnb.com) with a tourism survey frame. An attempt was also made to answer the question of how the data obtained from web scraping of tourism portals improved the quality of the frame. The study showed that Fuzzy Matching based on the Levenshtein algorithm combined with Vincenty’s formula was the most effective among all tested methods. In addition, as a result of data integration, it was possible to significantly improve the quality of the tourism survey frame in 2023 (an increase in the number of new accommodation establishments in Poland by 1.1% and in Bulgaria by 1.4%).
W obliczu wielu dynamicznych zmian zachodzących we współczesnym świecie, spowodowanych m.in. pandemią COVID-19, kryzysem migracyjnym i konfliktami zbrojnymi, ogromnym wyzwaniem dla statystyki publicznej jest dostarczanie informacji dobrej jakości, które powinny być dostępne niemalże w czasie rzeczywistym. W tym kontekście warunkiem koniecznym jest integracja danych, w szczególności big data, pochodzących z wielu źródeł. Głównym celem badania omawianego w artykule jest charakterystyka i ocena wybranych metod integracji danych w statystyce w dziedzinie turystyki: przetwarzania języka naturalnego (Natural Language Processing – NLP), algorytmu uczenia maszynowego, tj. K-najbliższych sąsiadów (K-Nearest Neighbours – K-NN), z wykorzystaniem technik TF-IDF i N-gramów, oraz parowania rozmytego (Fuzzy Matching), należących do grupy metod probabilistycznych. W badaniach dotyczących turystyki na szczególną uwagę zasługują dane uzyskiwane za pomocą web scrapingu. Z tego powodu analizowane metody wykorzystano do łączenia danych pochodzących z portali rezerwacyjnych (Booking.com, Hotels.com i Airbnb.com) z operatem do badań turystyki. Posłużono się danymi dotyczącymi Polski i Bułgarii, pobranymi w okresie od kwietnia do lipca 2023 r. Podjęto także próbę odpowiedzi na pytanie, jak dane uzyskane z web scrapingu wpłynęły na poprawę jakości operatu. Z przeprowadzonego badania wynika, że najbardziej przydatne spośród testowanych metod jest parowanie rozmyte oparte na algorytmach Levenshteina i Vincenty’ego. Ponadto w wyniku integracji danych udało się znacząco poprawić jakość operatu do badań turystyki w 2023 r. . (wzrost liczby nowych obiektów w Polsce o 1,1%, a w Bułgarii – o 1,4%).
Źródło:: Wiadomości Statystyczne. The Polish Statistician; 2023, 68, 12; 25-48
0043-518X
Pojawia się w:: Wiadomości Statystyczne. The Polish Statistician
Dostawca treści:: Biblioteka Nauki

Artykuł

Zmień widok

na półce

Informacja

Wyszukujesz frazę "Web-scraping" wg kryterium: Temat

Źródło danych

Dostawca treści

Kolekcja

Rok wydania

Wydawca

Temat

Autor

Typ dokumentu

Język