-
PYTHON > équivalents aux commandes Linux ( wget - grep - sed )
- try sudo apt-get install python3-wget.
- bitbucket.org/techtonik/python-wget
- This does none of the things OP asked for (and several things they didn’t ask for).
- The example try to show wget multi download feature
- No one asked for that. OP asked for the equivalent of -c, --read-timeout=5, and --tries=0 (with a single URL).
- I’m really glad to see it here, serendipity being the cornerstone of the internet. I might add whilst here though, that during my research I came across this for multithreading and the requests library: requests-threaded github.com/requests/requests-threads –
- This has two down votes and I have no idea why. Anyone who downvoted want to leave a comment? I know you could add regex compilation etc, but I thought that would detract from the clarity of the answer. I don’t think there is anything incorrect, and I’ve run the code, unlike some of the other answers
- This answer was perfect for me thanks. Just another quick question how would i print if no matches were found?
- "you should compile your regex before using the loops.", No, Python will compile and cache it on its own, it’s a common myth, it’s a nice thing to do for readability reasons, htough.
- The reasonable answer to the natural question is "Because the code is part of a much larger Python script, and who wants to call out to grep in such a case?" In short, I’m glad this question is here because I’m replacing a bash script with a Python script that is hopefully easier on the system.
- use sys.argv to get the command-line parameters
- use open(), read() to manipulate file
- use the Python re module to match lines
- La ligne en cours, dans la variable x.
- Le numéro de la ligne en cours, dans la variable i.
wget
urllib.request should work. Just set it up in a while(not done) loop, check if a localfile already exists, if it does send a GET with a RANGE header, specifying how far you got in downloading the localfile. Be sure to use read() to append to the localfile until an error occurs.
This is also potentially a duplicate of Python urllib2 resume download doesn’t work when network reconnects
When I try urllib.request.urlopen or urllib.request.Request with a string containting the url as the url argument, I get ValueError: unknown url type
There is also a nice Python module named wget that is pretty easy to use. Found here. This demonstrates the simplicity of the design:
import wget url = 'http://www.futurecrew.com/skaven/song_files/mp3/razorback.mp3' filename = wget.download(url) 100% [................................................] 3841532 / 3841532> filename 'razorback.mp3'
However, if wget doesn’t work (I’ve had trouble with certain PDF files), try this solution.
Edit: You can also use the out parameter to use a custom output directory instead of current working directory.
output_directory = <directory_name> filename = wget.download(url, out=output_directory) filename 'razorback.mp3'
Sorry for late reply, didn’t see this notification for some reason. You need to pip install wget most likely.
import urllib2 attempts = 0 while attempts < 3: try: response = urllib2.urlopen("http://example.com", timeout = 5) content = response.read() f = open( "local/index.html", 'w' ) f.write( content ) f.close() break except urllib2.URLError as e: attempts += 1 print type(e)
I had to do something like this on a version of linux that didn’t have the right options compiled into wget. This example is for downloading the memory analysis tool ‘guppy’. I’m not sure if it’s important or not, but I kept the target file’s name the same as the url target name…
Here’s what I came up with:
python -c "import requests; r = requests.get('https://pypi.python.org/packages/source/g/guppy/guppy-0.1.10.tar.gz') ; open('guppy-0.1.10.tar.gz' , 'wb').write(r.content)"
That’s the one-liner, here’s it a little more readable:
import requests fname = 'guppy-0.1.10.tar.gz' url = 'https://pypi.python.org/packages/source/g/guppy/' + fname r = requests.get(url) open(fname , 'wb').write(r.content)
This worked for downloading a tarball. I was able to extract the package and download it after downloading.
To address a question, here is an implementation with a progress bar printed to STDOUT. There is probably a more portable way to do this without the clint package, but this was tested on my machine and works fine:
#!/usr/bin/env python from clint.textui import progress import requests fname = 'guppy-0.1.10.tar.gz' url = 'https://pypi.python.org/packages/source/g/guppy/' + fname r = requests.get(url, stream=True) with open(fname, 'wb') as f: total_length = int(r.headers.get('content-length')) for chunk in progress.bar(r.iter_content(chunk_size=1024), expected_size=(total_length/1024) + 1): if chunk: f.write(chunk) f.flush()
Any way to show progress of file downloading?
easy as py:
class Downloder(): def download_manager(self, url, destination='Files/DownloderApp/', try_number="10", time_out="60"): #threading.Thread(target=self._wget_dl, args=(url, destination, try_number, time_out, log_file)).start() if self._wget_dl(url, destination, try_number, time_out, log_file) == 0: return True else: return False def _wget_dl(self,url, destination, try_number, time_out): import subprocess command=["wget", "-c", "-P", destination, "-t", try_number, "-T", time_out , url] try: download_state=subprocess.call(command) except Exception as e: print(e) #if download_state==0 => successfull download return download_state
FYI: This won’t work on Windows as the wget command isn’t implemented there. –
Let me Improve a example with threads in case you want download many files.
import math import random import threading import requests from clint.textui import progress # You must define a proxy list # I suggests https://free-proxy-list.net/ proxies = { 0: {'http': 'http://34.208.47.183:80'}, 1: {'http': 'http://40.69.191.149:3128'}, 2: {'http': 'http://104.154.205.214:1080'}, 3: {'http': 'http://52.11.190.64:3128'} } # you must define the list for files do you want download videos = [ "https://i.stack.imgur.com/g2BHi.jpg", "https://i.stack.imgur.com/NURaP.jpg" ] downloaderses = list() def downloaders(video, selected_proxy): print("Downloading file named {} by proxy {}...".format(video, selected_proxy)) r = requests.get(video, stream=True, proxies=selected_proxy) nombre_video = video.split("/")[3] with open(nombre_video, 'wb') as f: total_length = int(r.headers.get('content-length')) for chunk in progress.bar(r.iter_content(chunk_size=1024), expected_size=(total_length / 1024) + 1): if chunk: f.write(chunk) f.flush() for video in videos: selected_proxy = proxies[math.floor(random.random() * len(proxies))] t = threading.Thread(target=downloaders, args=(video, selected_proxy)) downloaderses.append(t) for _downloaders in downloaderses: _downloaders.start()
A solution that I often find simpler and more robust is to simply execute a terminal command within python. In your case:
import os url = 'https://www.someurl.com' os.system(f"""wget -c --read-timeout=5 --tries=0 "{url}"""")
—
grep
import re import sys file = open(sys.argv[2], "r") for line in file: if re.search(sys.argv[1], line): print line,
search instead of match to find anywhere in string
comma (,) after print removes carriage return (line will have one)
argv includes python file name, so variables need to start at 1
This doesn’t handle multiple arguments (like grep does) or expand wildcards (like the Unix shell would). If you wanted this functionality you could get it using the following:
import re import sys import glob for arg in sys.argv[2:]: for file in glob.iglob(arg): for line in open(file, 'r'): if re.search(sys.argv[1], line): print line,
you should compile your regex before using the loops.
/ul>Concise and memory efficient:
#!/usr/bin/env python # file: grep.py import re, sys map(sys.stdout.write,(l for l in sys.stdin if re.search(sys.argv[1],l)))
It works like egrep (without too much error handling), e.g.:
cat file-to-be-searched | grep.py "RE"
And here is the one-liner:
cat file-to-be-searched | python -c "import re,sys;map(sys.stdout.write,(l for l in sys.stdin if re.search(sys.argv[1],l)))" "RE"
Adapted from a grep in python.
Accepts a list of filenames via [2:], does no exception handling:
#!/usr/bin/env python import re, sys, os for f in filter(os.path.isfile, sys.argv[2:]): for line in open(f).readlines(): if re.match(sys.argv[1], line): print line
sys.argv[1] resp sys.argv[2:] works, if you run it as an standalone executable, meaning chmod +x
what’s the difference between re.match and re.search ?
see Nick Fortescue’s top answer: "search instead of match to find anywhere in string"
You might be interested in pyp. Citing my other answer:
"The Pyed Piper", or pyp, is a linux command line text manipulation tool similar to awk or sed, but which uses standard python string and list methods as well as custom functions evolved to generate fast results in an intense production environment.
The real problem is that the variable line always has a value. The test for "no matches found" is whether there is a match so the code "if line == None:" should be replaced with "else:"
—
Il y a le projet pyp, que l’on doit à Sony Pictures Imageworks qui avait besoin de se simplifier l’automatisation des tâches de build pour ses films.
Et il y a pyped, dont j’avais brièvement parlé ici (article qui mérite d’être mis à jour vu que j’ai remplace dateutils par arrow).
Les deux étaient sympas, mais avait des syntaxes alambiquées, n’étaient pas compatibles python 3 et manquaient d’imports auto. Cependant, pyped est récemment passé en v1.0, donc stable, et a une toute nouvelle approche de syntaxe qui rend la bestiole super agréable à utiliser.
Stdin, ligne à ligne
L’installation, c’est du pip :
pip install --user pyped
Et derrière, on obtient la commande pyp. Elle s’utilise essentiellement à la suite d’une autre commande. Typiquement :
cat /etc/fsta | pyp "un truc"
L’astuce, c’est que “un truc” peut être n’importe quelle expression Python. Généralement une expression qui print() quelque chose.
Or, Pyped met automatiquement à disposition de cette expression deux variables :
L’expression Python est appelée une fois pour chaque ligne.
Par exemple, supposons que j’ai un fichier “fortune.txt” contenant :
bitcoin (btc) : 5 euros (€) : 100 dollars ($) : 80
Si je veut tout mettre en majuscule, je fais :
cat fortune.txt | pyp "print(x.upper())" BITCOIN (BTC) : 5 EUROS (€) : 100 DOLLARS ($) : 80
On peut mettre plusieurs expressions d’affilé. Ainsi, si je veux récupérer la somme et le symbole uniquement :
cat fortune.txt | pyp "devise, sign, _, value = x.split()" "sign = sign.strip('()')" "print('%s%s' % (value, sign))" 5btc 100€ 80$
Ok, c’est plus long que perl, mais vachement plus facile à écrire et à relire. Et j’utilise un langage que je connais déjà. Et pas besoin de faire un mix incompréhensible de sed, awk et autre cut.
Si j’ai vraiment besoin de lisibilité, je peux même le mettre sur plusieurs lignes :
cat fortune.txt | pyp " devise, sign, _, value = x.split() sign = sign.strip('()') print('%s%s' % (value, sign)) " 5btc 100€ 80$
Vous aurez noté que j’utilise print() et que je semble ne pas me soucier de l’unicode. C’est parceque pyped fait ça au début du script :
from __future__ import print_function, unicode_literals, division, absolute_imports
Du coup, on est bien en Python 2.7, mais on bénéficie de la division améliorée, de la fonction pour printer, des imports absolus et surtout, de l’unicode partout. D’ailleurs pyped vous transforme x pour que ce soit un objet unicode.
Dans tous les cas, pyped est compatible Python 3.
Tout traiter d’un coup
Parfois, on a besoin d’avoir accès à toutes les lignes, pas juste les lignes une à une. pyped permet cela avec l’option -i. Les variables x et i disparaissent au profit de la variable l, qui contient un itérable sur toutes les lignes.
Par exemple, envie de trier tout ça ?
cat fortune.txt | pyp -i " lignes = (x.split() for x in l) lignes = sorted((v, s.strip('()')) for d, s, _, v in lignes) for ligne in lignes: print('%s%s' % ligne) " 100€ 5btc 80$
Parmi les trucs les plus utiles, il y a l’option -b qui permet de lancer un code avant la boucle. Pratique pour importer des trucs genre le module tarfile pour extraire une archive avant d’utiliser son contenu.
Néanmoins la plupart du temps on a rien besoin d’importer car pyped importe déjà automatiquement les modules les plus utiles : maths, datetime, re, json, hashlib, uuid, etc.
—