urlcounter
By Chris Lindgren chris.a.lindgren@gmail.com Distributed under the BSD 3-clause license. See LICENSE.txt or http://opensource.org/licenses/BSD-3-Clause for details.
Overview
urlcounter
is a set of functions that tallies full and domain URLs for periodic, event-defined social-media posting data. It assumes you seek an answer to the following questions about link-sharing:
- What are the top x full URLs and domain URLs from each group during each period?
- What are the top x full URLs and domain URLs from each group-module (detected community) in each period?
To use the module, import and follow the below example for guidance:
import urlcounter as urlc
dict_url_counts = urlc.top_urls(
df=cdf, #DataFrame of full corpus
periods=(1,10), #Tuple providing range of numbered periods
hubs=(1,10), #Tuple providing range of numbered hubs
period_dates=period_dates, #Dict of Lists with dates per period
list_of_regex=[htg_btw,htg_fbt,htg_anti], #List of regex patterns defined for each group
hl=hub_lists, #Dict with keyed lists of hub usernames per period
columns=['cleaned_urls', 'retweets_count', 'hashtags', 'username', 'mentions'], #Provide a List of column names to use for search and counting
url_sample_size=50, #Desired sample size limit, e.g., Top 50
verbose=True #Boolean. True prints out status messages, False prints nothing
)
Example outputs
It returns a Dict
keyed by user-defined group names, period ranges, and module ranges:
# Overall period-based URL summary data with keyed group name, 'fbt'
## '1' = Period 1
## 'fbt' = Keyed group name + 'urls_per_period' and 'domains_per_period' = Summary total data
output['1']['fbt_urls_per_period']
output['1']['fbt_domains_per_period']
# Overall community hub-based URL summary per period data with keyed group name, 'fbt'
## '1' = Period 1
## 'fbt' = Keyed group name 'fbt'
## '1' = Community hub/module 1
## 'hub_sample_size', 'hub_tweet_sample_size','hub_url_counts','hub_domain_counts' = Summary total data
output['1']['fbt']['1']['hub_sample_size']
output['1']['fbt']['1']['hub_tweet_sample_size']
output['1']['fbt']['1']['hub_url_counts']
output['1']['fbt']['1']['hub_domain_counts']
{'1': #Start period 1
'fbt_domains_per_period': [ #start period 1 totals for group keyed as 'fbt'
('twitter.com', 3003), ('instagram.com', 1001), ('facebook.com', 202)
],
'fbt_urls_per_period': [
('https://twitter.com/user/status/example', 202),
('https://www.instagram.com/p/example/', 202),
...
]}, #end period 1 totals for group keyed as 'fbt'
{'fbt': { #start period 1, module/hub 1
'1': {
'hub_domain_counts': [
('example.com', 178),
('example2.go.lc', 14),
('example3.com', 10),...
],
'hub_sample_size': 103,
'hub_tweet_sample_size': 486,
'hub_url_counts': [
('https://example.com/politics/story-title-1/',120),
('https://example.com/politics/story-title-2/',58),
...
]
}
}
}, #end period 1, module/hub 1
...
}, #end period
...
top_urls()
Tallies up URLs in corpus.
Arguments:
df
= DataFrame. Corpus to query from.columns
= a List of 5 column names (String) to reference in DF corpus. !IMP: The order matters:- Column with URLs (String) that includes a list of URLs included in post/content:
- Example: ['https://time.com','https://and-time-again.com']. The List can also be a String, '[]' since the function converts literals.
- Column with number of times a post was shared (Integer), such as Retweets on Twitter.
- Column with group data (String), such as hashtags from tweets.
- Column with usernames (String), such as tweet usernames
- Column with target content data (String), such as tweets with targeted users from module, or stringified list of targeted people like tweet mentions.
- Column with URLs (String) that includes a list of URLs included in post/content:
url_sample_size
= Integer. Desired sample limit.periods
= Tuple. Contains 2 Integers, which define the range of periods, e.g., (1,10)hubs
= Tuple. Contains 2 Integers, which define the range of module/hubs, e.g., (1,10)period_dates
= Dict of Lists with dates per period: pd['1'] => ['2018-01-01','2018-01-01',...]list_of_regex
= List. Contains:- list of regex patterns with group identifiers, such as hashtags
- String. Key identifier for group.
hl
= Dict. Contains lists of community-detected usernamesverbose
= Boolean. True prints out status messages (recommended), False prints nothing
Returns:
- Dict. See documentation for output details for data access.
url_counter()
Helper function for top_urls()
. It transforms an incoming list of Strings into a regex string to facilitate a search.
Arguments:
df
: DataFrame. Array of Strings to write as a regex String.columns
: A List of 4 column names to use from corpus, but only uses the first two in this function:- Name of URL column that includes a list of URLs included in post/content.
- Integer. Number of times a post was shared, such as Retweets on Twitter.
Returns:
- A
List
that includes: sorted_totals
: List of Tuples that contain 2 items:- String full URL
- Integer. Total number of URL instances (including RTs).
sorted_domain_totals
:- String domain URL
- Integer. Total number of URL instances (including RTs).
regex_lister()
Helper function for top_urls()
, but also can be used to create the group regex search parameters on its own. It transforms an incoming list of Strings into a regex string to facilitate a search.
Arguments:
the_list
: List. Array of Strings to write as a regex String.key
: String. Denotes the group name
Returns:
keyed
: Tuple with;'key'
(String) that denotes the group name'listicle'
(regex String) that will be used for a search
urlcounter
functions only with Python 3.x and is not backwards-compatible (although one could probably branch off a 2.x port with minimal effort).
Warning: urlcounter
performs no custom error-handling, so make sure your inputs are formatted properly! If you have questions, please let me know via email.
System requirements
- pandas
Installation
pip install urlcounter
Distribution update terminal commands
# Create new distribution of code for archiving sudo python3 setup.py sdist bdist_wheel # Distribute to Python Package Index python3 -m twine upload --repository-url https://upload.pypi.org/legacy/ dist/*