Python web scraper¶
- Author:
Scrapyer¶
This is a small utility for download xml zipped from a repository and store in a database
Install¶
git clone https://github.com/ambagasdowa/Scrapyer.git
python3 -m pip install Scrapyer
OR
python3 -m pip install git+https://github.com/ambagasdowa/Scrapyer.git
for a specific branch or tag add with @
python3 -m pip install git+https://github.com/ambagasdowa/Scrapyer.git@v3.1.4
or upgrade
python3.9 -m pip uninstall -y Scrapyer && python3.9 -m pip install --upgrade git+https://github.com/ambagasdowa/Scrapyer.git@SoapWs
Config¶
put a config.ini file under config user dir
in dir ~/.config/config.ini
edit according to your needs
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 |
|
Usage¶
Scrapyer –help
usage: Scrapyer [-h] [-d DATES] [-c] [-cc] [-x APPLICATION] [-m MODULES] [-v]
Small utility to download zipped xml files and parse contents to a db
optional arguments:
-h, --help show this help message and exit
-d DATES, --dates DATES
Set the date for process files inputs can be --dates={[date:range] , [date0,date1] or [date]} (default: yesterday)
-c, --config Takes dates from configuration values
-cc, --createConfig Create configuration files in ~/.config/Scrapyer/config.ini
-x APPLICATION, --application APPLICATION
Set the application to execute [cmex, michelin ]
-m MODULES, --modules MODULES
Set the modules to run for get data from api ws , use whith --application params, and get entry as module1,module2,...
-v, --debug Set the debug output to true
Example¶
dates : set the date for process files inputs can be –dates={[date:range] , [date0,date1] or [date]} (default: yesterday)’)
get a range of dates
Scrapyer -x cmex --dates='2022-12-01:2022-12-15'
read dates from config.py
Scrapyer --config -x cmex
Set application¶
Scrapyer -x app
Soap Response¶
data:image/s3,"s3://crabby-images/a6b97/a6b97235f5e24cfe1dc8756e728483f05369e09d" alt=""
Download zipped xml files for proccessing¶
data:image/s3,"s3://crabby-images/48906/48906147834076b79695b97d1ad3d8bffba9f5d3" alt=""
Proccessing unziped files¶
data:image/s3,"s3://crabby-images/6a1c9/6a1c924c90bd1b72dc4f5258cd069f4a1bf35861" alt=""
Verbose output¶
data:image/s3,"s3://crabby-images/2f06a/2f06a3ade163caa900fe5f6b789a56a1369d9dfc" alt=""
Save and proccess xml responses¶
data:image/s3,"s3://crabby-images/82130/82130e24db5ebb1a71575639df1f452ce9287a97" alt=""
Data in database¶
data:image/s3,"s3://crabby-images/fee23/fee23bc6f42d7eedf8bebcec8509d26d36507365" alt=""
Notes¶
The datetime in michelin web service is in GMT(Greenwich Mean Time) and need conversion to local time
GMT |
Local Time (Mexico) |
---|---|
GMT + 00:00 |
GMT -06:00 |
select DATEADD(mi, DATEDIFF(mi, GETUTCDATE(), GETDATE()), dbo.tbl.value_datetime)
SELECT CONVERT_TZ('2016-01-01 12:00:00','+00:00','-06:00');
Todo¶
TODO: Scrapyer Roadmap
Build functions for print messages and hide them
Add Addenda Module
Set an option for trigger the last procedure in cmex
Automatize the database installation
Add a rest-api {fastapi}
Build a Dashboard for data visualization { web , dash }